Proceedings of Extreme Markup Languages®

Generalizing XPath for directed graphs

Abstract

XPath is a very natural and powerful way to specify
locations in XML documents. This paper examines possible
generalizations of XPath to allow both locations and paths
through generalized labeled directed graphs to be specified. The
need for such a path language is driven by work in querying
Linguistic Annotations which are in general more complex in
structure than XML documents. The result of this exercise is a
powerful path language which reduces to XPath as a special case
and which could potentially be useful in a range of query
applications.

Steve Cassidy

Steve Cassidy is a Computer Scientist who has worked in
various areas relating to language and cognition over the last
15 years. He completed a Ph.D. in Wellington, New Zealand on
computer models of reading development and then moved to
Macquarie University, Sydney to work at the SHLRC [Speech Hearing and
Language Research Centre]. At SHLRC he worked on
applying statistical models to acoustic phonetics problems and
on the development of the Emu Speech Database System. His work
on Emu has led to an involvement with groups in the US and
Europe who are aiming to define standards for Linguistic
annotation. Steve is now working at the Centre for Language
Technology, part of the Computing Department at Macquarie where
he is persuing research in annotation and speech and language
technology.

Generalizing XPath for directed graphs

Extreme Markup Languages 2003® (Montréal, Québec)

Background

This paper describes a generalization of the XPath query
language motivated by the design of a query language for
Annotated language corpora.

Annotated corpora

Research in Linguistics and Language Technology relies
heavily on collections of language data called corpora [Church & Mercer 93]. These can consist of text
documents (written texts or transcriptions of speech), audio,
video or other time series data (for example, physiological
measurements of speech articulation). A common requirement in
most fields of research is to apply some kind of annotation to
the corpus data. This annotation is often derived manually and
encodes additional information about the linguistic data
ranging from simple part of speech tags for words to complex
rhetorical structures in dialogue.

A particular feature of linguistic annotations is that
they often involve more than one hierarchy over the same set of
tokens. For example, Figure 1
shows part of a sentence which has been annotated syntactically
and intonationally resulting in two hierarchies dominating a
single set of word nodes.

Figure 1

An example Emu annotation showing an utterance from
the TIMIT database [TIMIT] which has been
augmented with both a syntactic and intonational
annotation. The names on the left show the types of each
set of nodes. The Word nodes have been duplicated to show
the links to both the syntactic and intonational
hierarchies. For clarity, sequence relationships are
implicit in the left to right ordering of nodes in this
diagram.

Recent work has attempted to develop a uniform data model
for manipulation and interchange of annotated corpus data. In
one model, annotations are viewed as directed graphs [Cassidy 01] where nodes represents tokens in
the corpus. Nodes may be labeled with various attributes
including a type (phoneme, word, paragraph), start and end
times, and one or more labels. The edges in this graph can have
one of three types: a parent/child relationship defining one or
more hierarchies, an explicit sequence relationship, and a
general associative relationship. This data model is capable
of representing annotations such as the one in Figure 1, as well as many other styles of
annotation. The optionality of timing data means that the model
is also a good fit to pure textual data.

An alternate data model, called Annotation Graphs, has
been proposed [Bird & Liberman 01] which again
represents annotations as a directed graph but foregrounds the
sequential structure of annotations leaving hierarchical and
associative relations implicit. In an annotation graph (Figure
2), nodes are anchors which may
have associated timestamps and edges correspond to tokens in
the corpus and have one or more labels. Hierarchical relations
in annotation graphs are implicitly encoded by edges which span
multiple smaller edges. Associative relations can be encoded by
common labels in the same way as ID/IDREF is used in XML. While
Annotation graphs might seem very different from the Emu
hierarchical model illustrated above, we believe that they are
isomorphic [Cassidy 01], and in fact
the next version of the Emu toolkit will use annotation graphs
internally.

Figure 2

A simple Annotation graph where nodes represent
anchors (in this case, with associated times), and edges
represent tokens in the corpus (e.g., words and phrases). The
hierarchical relationship between the NP edge and the word
edges is implicit in this model.

Toolsets in various specialist domains exist to create
and make use of annotated corpora. Examples are the Emu system
http://emu.sourceforge.net for acoustic phonetic analysis,
GATE http://gate.ac.uk, and MATE [McKelvie et al. 01] for text based data. As part of some
of these tools, a query language is provided to find parts of
the corpus for study or analysis. For example, the Emu system
allows queries which retrieve speech segments according to
various criteria allowing subsequent numerical analysis of the
speech data. Following the development of the more general
data models described above, the need for a more general query
language becomes apparent; this paper describes some work aimed
at developing such a language.

Relation to XML

Clearly the annotation data model shows a close
resemblance to XML, and indeed if only one hierarchy needs to be
encoded in an annotation, it can be naturally serialized as
XML. Many existing Linguistic text corpora are encoded in XML,
and there have been large scale efforts to develop standardized
DTDs for corpus encoding [TEI]. The problem
of intersecting hierarchies has been addressed using, for
example, standoff markup [Thompson & McKelvie 97] where each
hierarchy is stored in a separate XML document, each referring
to regions in the original text using XPointer
references.

XML has been less frequently used for annotation of
speech data. It is used as an interchange format in some tools,
such as the Annotation Graph toolkit; however, in these cases
the structure of the XML document does not reflect the
hierarchical structure of the annotations.

Given this close relationship, an obvious question is
whether any of the various XML query languages are appropriate
for the task of querying annotations.

Annotation query use cases

In an earlier paper [Cassidy 02], we
presented a number of annotation query use cases and used them
to evaluate XQuery as an annotation query language. This
showed that while a number of the queries could be evaluated
within XQuery, some interesting cases were very difficult to
express. Here we present two examples which illustrate
differences between this application and the XML query
use-cases.

An
example answer to this query is the L- Intermediate level node
containing the words “your dark suit” in Figure 1. This query highlights
the need to be able to cope with multiple hierarchies since it
requires looking at both Intermediate and Syntax parents of the
Word nodes. While XML won’t allow multiple parents, there’s
nothing in XQuery (or in particular, XPath) which precludes a
traversal from parent to child to a different parent. If we
allow this, the query can be written in XQuery as shown in
Figure 3 (of course, the
result here would be an XML fragment which might not be what is
required; we’ll ignore this factor for the moment).

Find sequences of any number of consonant phonemes
followed by a single vowel
followed by any number of consonants: C+VC+.

This is a significant
use case since it illustrates an important class of query for
language corpora — finding non-annotated patterns based on ‘grammatical’ structure. At the same time, this kind of query
is very difficult to express in XQuery (although not
impossible, see [Cassidy 02] for a solution
using functions).

The evaluation of XQuery with respect to these and other
use-cases provides two interesting insights. Firstly, the
document or data centric views provided for by XQuery (and
other XML query languages) are almost, but not quite, appropriate
for annotation query. Secondly, XPath is a very clear and
natural way to specify locations of elements within annotations,
but would need to be extended to achieve completeness.

Outline

This paper reports on some work which explores the idea
of extending XPath in directions which make it more useful for
annotation query. In particular, it describes an attempt to
generalize XPath to address queries like the second one
presented above (C+VC+). The result of this exercise is a
generalized path-based query language which applies not only to
annotations, but also to generalized directed graphs, and which
reduces to XPath as a special case.

For the purposes of this paper, I’ll talk about a
particular kind of directed graph which is a generalization of
the graphs described for annotations. These graphs can be
described as edge labeled directed
graphs. An example is the XML graph structure which
contains edges labeled “child”, “attribute”, and a few
others. The nodes in these graphs have two labels: a type
(cf. the XML element name) and a textual label. This is an interesting class of graphs which subsumes, for example, XML,
RDF, and our annotation data.

The remainder of this paper reviews the literature on
path-based query languages and then describes the
generalization of XPath and evaluates this in the annotation
domain and in more general graph-based data.

Query language review

Annotation query

For the most part, the query languages proposed and
developed for annotation data have been of ad-hoc design and
are able to cope only with a subset of the kinds of annotation
data seen in practice. For example, the Emu query language
[Cassidy 01] provides a means of
finding nodes based on sequential and hierarchical relations
but lacks many basic query language features, such as
disjunction and compositionality. Most other annotation query
tools have made use of general purpose structural pattern
matching tools, such as Sgrep
http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html,
which allow for regular expression like searches in structured
textual data.

One important proposal that has been made recently
follows from the work on annotation structures by Bird and his
colleagues [Bird & Liberman 01]. [Bird et al. 00] propose a query language based on path
expressions in an annotation graphs. Queries are made up of a
set of path expressions which describe paths through the
annotation graph. For example, in the query term X.[:word, label:cat].Y, X and Y are
query variables and stand for nodes in the annotation graph, and
the terms in brackets specify the type (:word) and attribute values associated
with the edge between these nodes.

While this query language proposal has a number of
interesting features, it is awkward to express queries in.
Since it focuses on traversing only the sequential relations in
an annotation graph, expressing hierarchical queries is very
unnatural. Evaluation against a number of annotation query use cases
has shown [Cassidy et al. 00] that while it is
capable of expressing all of the use cases, the constructs
needed to do so are highly complex and would be difficult for
naive users to work with. Part of the effort in this paper is
to work with a very natural query language component (XPath)
and see if it can be extended for annotation query wile
retaining the naturalness of expression.

Path-based query languages

There is a considerable literature on path-based query
languages both as alternatives to relational query languages
and in more specialized domains. In particular, path
expressions have often been proposed in the context of querying
semi-structured data. Abiteboul [Abiteboul 97]
describes path expressions
as sequences of object attributes (e.g., guide.restaurant.address.zipcode) and
extended path expressions
as those which allow for wildcards and path variables
(e.g., myreport.#.title to
match any title attribute of an object reachable from
myreport). These are implemented in, for example, OQL-doc
[Christophides et al. 94], which extends an object-oriented
database query language with path expression.

The G+ path language [Mendelzon & Wood] and its
successor Graphlog [Consens & Mendelzon] provides a means
of specifying regular path expressions to match against paths
in edge labeled graphs. Queries in Graphlog are expressed
graphically as a graph where the edges can be labeled with
path regular expressions
which match sequences of edges in the graph being queried.
Graphlog queries can include closures over edges (e.g., follow zero or more child edges) and negation (e.g., there’s no child edge between a pair of nodes).

One feature missing from Graphlog is the ability to
specify conditions on the nodes traversed in a path. Since
Graphlog path regular expressions relate to sequences of edges,
they are able to describe arbitrary sequences of edge
traversals. The nodes traversed are not constrained in any way, so one can’t, for example, constrain each one to have a
certain property. It is possible to express this kind of query
by defining new relations via sub-queries (one would define a
relation sequence of C and
then look for the closure over that). This is one required
feature that can perhaps be addressed more concisely in our
generalization of XPath.

Graphlog provides a powerful an expressive query language
which is supported by a thorough analysis of the power of the
language and the complexity of query processing. As such it may
provide some theoretical basis to relate to the work described
here on generalizing XPath.

XPath

XPath (http://www.w3.org/TR/xpath) is a component of
a number of XML standards and is used to select specific nodes
in an XML document for subsequent processing. An XPath
expression is a mapping from a node (the context node) to the
set of all nodes reachable by the specified path. A path
expression is written as a series of steps where each step
defines the axis used to
reach new nodes and a node
test used to restrict the set of nodes reached
along the axis. In XPath, axes are defined as a finite set of
node-to-node mappings in the graph and reflect the structure of
XML documents; example axes are child, ancestor, attribute, and self. Node tests consist of two
parts: a restriction on the element name and an optional
predicate expression. The XPath standard also contains
additional features such as built in functions.

XPath was built as a pragmatic solution to the problem of
locating nodes in an XML graph. There is no underlying path
algebra that has been used to guide the development of the
language (although see [Wadler 00] for a
post-hoc analysis of the formal semantics of the XPath 1.0 and
http://www.w3.org/TR/xquery-semantics/ for the official
version for the XPath 2.0 proposal). Features have been added
to make it work in the context of a number of W3C
recommendations. In particular, the inclusion of function
definitions in the XPath 2.0 (http://www.w3.org/TR/xpath20)
proposal provides arbitrary computational power to the language
but arguably muddies the idea of pure path expressions. The
goal of this paper is to try to add some expressive power while
keeping to the core concept of path expressions.

Generalizing XPath

XPath limitations

The widespread adoption of XPath is in part due to the
naturalness of expressing node locations as paths from a known
node in the graph. The abbreviated XPath syntax is very
familiar to any web developer familiar with the URI syntax or
unix pathnames. There are, however, a number of arbitrary
limitations inherent in XPath which, if relaxed, might make
result in a path language applicable to a wider range of data,
in particular to annotated corpora.

From the use case analysis of XQuery, there are a number
of restrictions in XPath which could be relaxed in order to
make it a more general query language component. These are
listed here with some commentary on the motivation for
each.

The result of evaluating a path expression is a
node set. In some applications, it would be useful to be able
to retrieve the path that was traversed. Hence, the result
of a path expression would be a path set.

Limited, non-extensible set of axes.

Inability to state conditions on paths as opposed to nodes. This
extends to the notion of defining regular expressions on
paths.

Returning path sets

While locating nodes in a graph is the sole use of
XPath, there are cases in annotation query and perhaps in
other applications where selecting paths through the graph is
important. The prototypical example from annotation query is
find sequences of phonemes with
property X. Note that no changes to the path
language are needed to achieve this, but that the query
processor needs to be augmented to maintain path sets rather than node
sets.

Defining axes

The set of axes defined in XPath is clearly designed to
allow the set of graph traversal operations that are seen to
be atomic in XML document trees. In a more general graph
structure, it is clearly more difficult to foresee what axes
will be needed, and in general it is more useful to define
what an axis is than to prescribe a particular set of axes.

An XPath axis is fundamentally a mapping from nodes to
node sets that defines a way of traversing the underlying
graph. Each axis encapsulates two things: a type of edge to
follow (e.g., child vs. attribute) and whether to follow it
transitively (e.g., child vs. descendant).

One approach to extending the axis set is to allow
arbitrary definitions of axes as mappings from nodes to node
sets. If we assume that the graph structure being searched is
the edge labeled directed graph described earlier, then any
edge label might be associated with an axis. Axes might also
be defined for more complex traversal — an example might be
an uncle axis which
returns siblings of a node’s parent using a combination of
parent and child edges. Similarly, for an XML document one
might define an IDREF
axis which followed IDREF links.

This approach would allow more control of
transitivity in the axis definition as is the case with the
descendant axis in
XPath. However, a more general solution is to factor this out
as a modifiable parameter in a path expression. The current
binary single step/transitive distinction can be generalized
to a how many steps
parameter which controls how many steps along an axis are
allowable. For example, I could find my grandparents by
stepping along the parent axis exactly twice; I could find
all of my children and grandchildren by stepping along the
child axis up to two
times.

The proposal here is to define an axis as a function
mapping nodes to pathsets; the function takes a parameter
denoting how many steps to take along the axis in generating
new paths.

Path conditions

In XPath, conditions apply only to the last node on the
path extended from the context node; for example, the
expression /descendant:para[@style='important'] traverses
along any number of child edges to find paragraphs with a
particular attribute. No conditions are placed on any of the
intermediate nodes along the axis.

Rather than limiting conditions to the final element of
a path, the approach taken here is to allow conditions on any
subset of the nodes traversed while extending the path. For
example, a condition might apply to all nodes, the first node,
or the last two nodes.

Introducing this generalization allows the use-case
described earlier to be addressed. A sequence of nodes
labeled C can be located by walking along the following axis an unlimited number
of steps applying the condition to every node.

Extend programs

In order to generalize XPath, it is useful to first reduce
it to a minimal form; one way is to view an XPath expression as
a sequence of function applications each mapping a node set to a
node set. Each call finds new nodes along some axis, possibly
applying some node tests. Note that this abstraction ignores
XPath features, such as sequence expressions and function calls;
these fall outside the core path language and might well be
added back in when the generalized path language is defined.

The generalized language can be defined by defining
similar function calls in the following manner. Each path step
is a call to an extend
function which maps path sets to path sets. The signature of
extend is as follows:

extend [axis] [step] [condition] [where]

where [axis] is the name of
the axis to traverse, [step] is a restriction on the number
of steps taken along the axis, [condition] is a boolean condition on
nodes and [where] is a
restriction on where the condition must hold. These are
discussed in detail below.

Axis

The [axis] argument
names an axis which must define a mapping from nodes to
nodesets in the graph. Each axis need only define what it
means to take a single step from a node. We foresee a
framework providing for definition of new axes on
particular graphs in terms of a base set of axes or via a
general purpose programming language.

Step

The [step] argument
determines how man steps should be taken along the axis and
takes the form of a list of positive integers or the special
value inf. The
interpretation of this list is that a path length is allowed
if included in the list. The special value of inf means that any length of path is
valid. If the list includes integers and inf, then any path of length greater
than the highest integer is valid. For example, [2,3,4] would allow paths of length
2, 3 or 4 but not of length 1 or 5, [1,10,inf] would allow paths of
length 1 or longer than 10. It is expected that syntactic
sugar would be introduced to specify ranges of integers,
etc. Note that when the only integer specified is zero, the
path will not be extended at all; this only makes sense in
combination with where =
0, see below for a discussion.

Condition

The [condition]
argument is intended to be a general boolean condition on
nodes in the same sense as XPath conditions. This would
include recursive calls to the path language and any
conditions on node labels that are appropriate to the
application. The exact form of these conditions is not
specified here; in the trial implementation, we have
implemented a general callout to the host programming
language (Tcl). In the examples below, the conditions are
expressed as a set of tests on the value of labels. As in
XPath, the condition operates on a context node; unlike XPath,
the condition may be applied to many nodes along a
path.

Where

The [where]
argument specifies where on the path the condition must hold.
As with [step], this is a
list of positive integers plus the special values of
inf and end. An integer in the list means
that the condition is applied to that node on the path. The
special value inf means
that the condition must hold for all nodes in the path. The
special value end means
that the condition applies to the end of the path only. For
example, the list [1,2,end] would find paths where the
first, second and last nodes satisfied the condition.

An important special case is when the where argument is includes a zero. Path positions are indexed from one so that the first
node added to a path has an index one. The zero index refers to
the context node — that is the start node for the path
extension. If where
includes zero, then the condition must hold on the context
node with the result that all existing paths whose last node
does not meet the condition will be pruned. In XPath, the
same effect is achieved via a special axis called self; the mechanism proposed here
removes the need for this special axis. In combination with a
zero step argument, this can result in a pure prune operation
where no new paths are extended, but any existing paths not
fulfilling the condition are pruned.

Initial conditions

In XPath, there is a unique starting node from which an
XPath expression is evaluated. This is either a node defined
by external context or the root of the XML document. Extend
programs could maintain the idea of an externally provided
node (or node set or path set), but since the underlying data
model is a generalized directed graph, there is no equivalent
to the document root. Two possible starting conditions seem
plausible: start with the empty node set, or start with the set
of all nodes. Starting from the empty set requires some magic
behavior for each axis: walking from the empty set along any
axis leads to the set of all nodes. Given this, it seems best
to assume the full node set as a starting point for an extend
program which isn’t given external context.

Example extend programs

To illustrate the ideas described above, here are two
example extend programs which implement XPath and
annotation query use-cases. In these examples, a global path
set is assumed, but in practice this would be passed between
calls to the extend function.

//book/chapter/following-sibling::chapter.
Using the child and following-sibling axes and assuming that
each node has a element
label corresponding to the element name:

In this example, the first instruction finds book nodes along
the child axis and is allowed to make any number of steps;
only the final node in the path needs to be a book. The
second instruction finds chapter nodes one step away. The
third instruction makes any number of steps along the
following-sibling axis to find another chapter node; again,
only the end node is tested. The result of this program would
be a set of paths; to generate the equivalent set of nodes as
the XPath expression, the end nodes of each path can be
taken.

Find sequences of any number
of consonant phonemes followed by a single vowel followed by
any number of consonants: C+VC+. Assuming a
following axis which
connects nodes in sequence, the extend program is:

Here, the first instruction extends the start set of all
nodes by between zero and infinity steps ensuring that all
nodes, including the start node, are labeled C. The second
instruction extends by exactly one step to a node labeled
V. The third instruction is similar to the first, but doesn’t
constrain the zeroth node (which we know is a V).

Syntax

While extend programs might provide the required
expressive power, it is clearly not suitable for end users, and
a more natural query syntax is desirable. An obvious choice
is to stay close to the XPath syntax and add new elements for
the step and where parameters. One possible syntax would be:

/axis{step}::condition::{where}

The step and where components would default to their XPath
values (step = 1 and
where = end). The
condition component would be a literal name (which would
match against some application defined property of the
node) with an optional node predicate in square brackets as
per XPath. Using this syntax, the second example above
would be written:

In practice, it is likely that an abbreviated syntax
might be defined in particular application areas; for
example, the most common axis (such as child in XML) might be omitted, and
short forms might be introduced for common node tests.

Applications of generalized XPath

This section illustrates potential applications of the
generalized path language in a number of domains. These examples
are necessarily contrived since the focus of this work has been
on annotation query. However, they serve to illustrate the
properties of the path language in a more familiar domain for
most readers. The point of the examples is not “here’s something I can do that no other QL
can” but merely to show that in some possibly
interesting use cases, the generalized path language provides a
clear way of expressing the desired result.

RDF

The RDF [Resource Description Framework]
http://www.w3.org/RDF/ defines a data model for
metadata consisting of a directed graph of nodes representing
resources or values and edges representing relationships. As
such, RDF graphs provide an interesting test case for the new
path language.

RDF graphs are constructed from a set of triples of the
form [subject, predicate,
object], for example, [http://www.ics.mq.edu.au/~cassidy/, creator,
Steve]. The subject is also referred to as a
resource and is usually
written as a URI. The object can be either a literal value or a
resource URI which itself might be the subject of another
triple.

The primary feature of the path language of use in RDF
is the ability to define new axes based on the properties of
the graph. While there are a number of ways of doing this,
one obvious mapping is to use predicate names (edge labels) as
axes names. This gives a potentially limitless set of axes,
but an application might restrict the set via a schema and,
hence, allow more efficient evaluation of path expressions. In
addition, an application might define compound axes based on
traversal of multiple edges.

Other features in the path language — the extraction of
paths and application of tests to whole paths — are less
obviously applicable here. Returning whole paths would allow
one to use the path language to generate subgraphs of the
original RDF graph which might be useful as a query language
component.

Node tests for RDF graphs could be simple comparisons
with the node label — that is the resource name or literal
value. As an alternative, the special status of the rdf:type might be recognized in the
path language to allow selection of nodes based on type. So,
the proposed path syntax for locating nodes in RDF graphs is:

predicate{step}::TypeOrLiteral[predicate]::{where}

where TypeOrLiteral is treated as a disjunctive comparison with
the node label or the rdf:type of the node. predicate is a path expression which
is evaluated with the context node set to the current node
being matched.

RDF Calendar Query

Figure 4

A fragment of an RDF graph corresponding to an
event with one attendee.

Libby Miller
(http://swordfish.rdfweb.org/rdfquery/) provides a number
of RDF query use cases illustrating the capabilities of the
Squish RDF query language. One interesting case involves
querying a collection of events (RDF icalendar) and
associated information about individuals (RDF foaf). The
graph in Figure 4 shows a small
fragment of this kind of data. One example query
(http://swordfish.rdfweb.org/discovery/2001/07/swws/) is
"Get me details of all the events
that Hendler is attending".

The key to this query is to locate just those events
which have someone named Hendler as an attendee.
This can be expressed as a path expression:

Clearly, these components would need to be integrated into some processing framework, such as that provided by XQuery or XSLT.
It would not be too much of a stretch to imagine an XSLT-like stylesheet language for deriving XML serializations of RDF graphs.

Finding friends

As a simple example of how the step feature might be used with RDF,
consider the problem of finding friends within a FOAF [Friend
of a Friend] RDF datastore. For example, if I want to throw a
party and invite all my friends and their friends and their
friends I could use the path expression:

to find the names of all of these people. The step restriction {1,3} will allow traversal of between 1 and three foaf:knows edges. The names of the people we want to invite are the final nodes on all the paths returned from the above path expression.

XML query

While XPath does a good job of selecting elements from
XML documents, there are a few use cases that might benefit
from the additional features described here.

Building annotations

When dealing with legacy annotations, it is often
useful to be able to add new structure to a document marked up under an older schema. In some cases, this means
adding new parent elements to group a sequence of
elements. An example here might be a book which has been
marked up as a series of chapters, each with a type attribute
indicating its status in the book:

A document processor could then build new structure around the element sequences returned by the path expression.

Implementation

We have a trial implementation of an extend program
interpreter written in Tcl which can operate on arbitrary graph
data. Demonstrations have been build for XML and generalized
graphs, and we are working on an implementation for Annotation
Graphs. Currently, no parser for the syntax shown above has been
written as there are still many questions about how this will
work in general.

In our implementation, a new graph type can be added to
the system by defining a single class which provides an
interface to the underlying graph. This class defines methods
corresponding to the axes defined on the graph, each of which
takes a node and returns the node set found by traversing the
axis in the underlying graph. It is relatively straightforward
then to implement a class as a wrapper around an existing graph
data model such as XML.

The algorithm used to process extend programs is a simple
naive one which doesn’t try to optimize the process in any
way. The main process is to build up the path set by calling the
axis methods of the underlying graph following the step and
where parameters passed to extend. The Tcl extend function takes
an additional argument of a path set so that subsequent calls to
extend can build on earlier ones. Here is a sample extend
program which implements the C+VC+ query from earlier in the paper:

It is worthwhile acknowledging that an efficient
implementation of this language might be very difficult given
its generality. However, we’ve not yet looked at this aspect of
the problem. It may be that restrictions need to be placed on
the language to enable efficient implementation.

Discussion

In conclusion, XPath can be generalized to a new path
language for generalized edge labeled directed graphs, which
provides a very consistent set of semantics for axes and path
extension. This new language reduces to XPath as a special case
where only single steps are taken along an axis and conditions
are only applied to nodes at the end point of the path. We
believe that this is an interesting generalization which might
form the basis of a powerful and natural query language for
linguistic annotations and which might find application in other
areas.

Acknowledgments

I’d like to thank Dan Steffan for the implementation of the
prototype extend program interpreter and discussions on the graph
searching problem.