XML Path Language (XPath) and its functional implementation SXPath

Kirill Lisovsky, Dmitry Lizorkin

Institute for System Programming RAS, Moscow State University

XPath is a language for addressing structural parts of an XML document. This paper gives an overview of XPath and discusses its application for digital libraries.

SXPath is the compliant XPath implementation in functional programming language Scheme. SXPath is based on the SXML data model -- XML Information Set represented in the form of S-expressions. SXPath design considered in this paper illustrates the suitability of functional methods for XPath implementation and virtually unlimited capabilities of SXPath/Scheme combination. SXPath may be used as a query language for an XML-based digital library.

XML Path Language (XPath) [1] is a language for addressing parts of an XML document [2]. XPath is the key language in the stack of XML technologies, and is used by many important XML languages, in particular XSLT, XPointer and XQuery.

In our previous article [3] we discussed how Electronic Library resources can be described with Dublin Core and represented as an RDF/XML document. We recall such a description on figure 1, and we'll further refer to it in this article. To make use of these resource descriptions, an application needs a tool for retrieving necessary information from an XML document. For example, to find a relevant resource in the Electronic Library, an application will probably want to address some fields in the resource description, for analyzing them in accordance with a search request. Since fields in the RDF/XML description are parts of an XML document, XPath discussed in this article is a primary tool for such an addressing.

The purpose of XPath is to address parts of an XML document. XPath treats a document as a tree of nodes, since an XML document is essentially a tree structure. Parts of a document are addressed by a so-called XPath location path which has a textual syntax. The location path is usually applied to the root of the XML document, and the result of evaluating the location path for that document is the node-set consisting of (possibly, multiple) nodes selected by the location path. These selected nodes correspond to XML elements, attributes, character data and other parts of the initial XML document.

Figure 2 gives an example of the XPath location path. A location path consists of a sequence of one or more location steps separated by /. For example, the location path on figure 2 consists of 4 location steps.

rdf:RDF/rdf:Description/dc:title/text()

Table 2: A sample XPath location path

The steps in a location path are evaluated in turn from left to right. The leftmost location step is evaluated first, usually with respect to the node which represents the root of the XML document. Each of the following location steps selects a node-set, which is evaluated with respect to the node-set selected by the previous location step. The node-set selected by the rightmost step is the result of the whole location path for that XML document.

An axis, which specifies the tree relationship between the nodes given to the location step and the nodes selected by the location step. An axis can be thought of as a "direction of movement" within a tree which represents the XML document. XPath specification introduces 13 different axes. These include axes for descending to the leaves of the tree, for ascending in the direction of the root, for selecting sibling nodes and so on. Syntactically the axis name is separated from the rest of the location step by the "::" (double colon).

A node test, which specifies the type and possibly the name of the nodes selected by the location step. While the axis specifies the "direction of movement", the node test specifies the desired nodes to be selected.

Zero or more predicates. Each predicate is syntactically written in square brackets and is used to further refine the set of nodes selected by the location step.

For the sake of brevity, we won't discuss XPath predicates in this paper, referring the reader to [4].

The most important XPath axis is the child axis. The child axis contains the children of the context node. For the node representing the XML element, the child axis selects both its nested XML elements and nested text nodes. For the root of the document, the child axis selects the document element. The child axis is the default XPath axis, i.e. the word child:: may be omitted in the XPath location step. In the above example on figure 2, every location step uses the child axis.

Two XPath node tests are considered in this paper.

The node test text() accepts text nodes only. In other words, if we use the terminology suggested in the XPath Specification [1], the node test text() is true for every text node. In the example on figure 2, the last (the rightmost) location step uses this node test.

The node test that is an XML (qualified) name is true for the node with the same name. For the XML element node, the node's name is the name of this XML element. In the example on figure 2, the first three location steps use node tests that are XML qualified names (respectively, rdf:RDF, rdf:Description and dc:title). The mechanism of namespace prefix expansion in XPath is discussed in a more detail in the next subsection.

If applied to the root of the XML document from figure 1, the location path from figure 2 will evaluate in the following way:

The first location step (evaluated with respect to the root of the document) selects the node-set consisting of just the document element rdf:RDF (it satisfies the node test);

The second location step selects all rdf:Description children of the document element (there is exactly one such element in the sample document);

Similarly, the third location step selects all dc:title children of the previously selected rdf:Description element (there is exactly one such dc:title element in the sample document);

Finally, the forth location step selects child text nodes of the previously selected dc:title element. This evaluates to a node-set consisting of the single text node "Algebra", and this becomes the result of evaluating the whole location path for the sample document.

Thus, the location path from figure 2 addresses the title of the book described in the XML document from figure 1.

Elements and attributes in an XML document generally have their names qualified [5] with a Uniform Resource Identifier (URI). As it is discussed in our previous article [3], the role of the URI in a name is purely to allow applications to recognize the name. The XML Namespaces Recommendation qualifies names with URIs in an indirect way, based on the idea of a prefix. If an element type name or attribute name contains a colon, then the part of the name before a colon is considered as a prefix, and the part of the name after a colon -- as a local name. A prefix foo refers to the URI specified in the value of the xmlns:foo attribute [6]. Thus, the name of a node is modeled as a pair consisting of a local part and a (possibly null) namespace URI; this is called an expanded-name.

To select elements and attributes which have qualified names, XPath uses the node test which is itself a qualified name, i.e. it syntactically consists of the local name and the prefix separated by a colon. A qualified name in the node test is expanded into an expanded-name using the namespace declarations from the context of the evaluated XPath expression. The namespace declarations consist of a mapping from prefixes to namespace URIs. A node test that is a qualified name is true only if the node has an expanded-name equal to the expanded-name specified by this qualified name, i.e. they have equal namespace URIs and equal local names.

The XPath processor is provided with the namespace declarations by XPath user, in particular, by XSLT or XPointer. XSLT and XPointer Recommendations specify how the namespace declarations are determined for XPath expressions used in XSLT and XPointer respectively.

It should be noted that the XPath namespace environment and the document namespace environment are totally independent. In an XML document, a prefix foo refers to the URI specified in the value of the nearest xmlns:foo attribute, and the prefix name is chosen by the document designer. In an XPath location path, a prefix refers to the namespace declaration, and the prefix name is chosen by the application developer. These two kinds of prefixes have nothing to do with each other. Instead, the corresponding namespace URIs are compared when the XPath node test is evaluated.

To make our XPath example from figure 2 accurate, we now should mention the namespace declaration: the rdf: prefix is associated with the RDF namespace "http://www.w3.org/1999/02/22-rdf-syntax-ns#", and the dc: prefix is associated with the Dublin Core namespace "http://purl.org/dc/elements/1.1/".

In this section we will consider SXPath [7], which is an XPath implementation in functional programming language Scheme [8].

First we would like to note that textual representation of an XML with familiar angular brackets is just an external representation of an XML document [9]. Applications (and XPath processor in particular) ought to deal with its internalized form, which would allow an application locate specific data or transform it.

For example, XPath implementations may use the Document Object Model (DOM) as an application programming interface for managing XML. As discussed in [10], a big difference between the external textual XML notation and this internal object model leads to an infamous problem of impedance mismatch [11].

XPath implementation in Scheme allows us to use a very natural and convenient internal representation for XML data - SXML [9]. SXML is a representation for an XML document in the form of an S-expression. As we've considered in [10], XML and SXML textual notations are much alike: roughly speaking, SXML just use a parenthesis instead of every pair of angular brackets. This makes SXML notation easy to learn, concise and easy to read, and easy to edit even by hand. SXML can thus be considered as both the external representation of the document and its internalized form at the same time. To illustrate the idea, figure 3 shows the SXML analog for the XML document from figure 1.

Table 3: The SXML document which corresponds to the XML document from figure 1

An XML document can automatically be converted into the corresponding SXML form with a functional Scheme XML parsing framework SSAX [12].

SXPath fully conforms to XPath specification by W3 Consortium. Figure 4 illustrates how the location path considered on figure 2 can be evaluated with SXPath for the above SXML document from figure 3. Here we suppose that this document is bound to the Scheme identifier doc.

((sxpath "rdf:RDF/rdf:Description/dc:title/text()")
doc)

Table 4: Evaluating the location path from figure 2 with SXPath (here we suppose that the identifier doc is bound to the SXML document from figure 3)

SXPath naturally models a node-set selected by a location path as a Scheme list, whose members are SXML nodes. Thus, example on figure 4 returns a list consisting of the single text node "Algebra".

XPath location path is parsed. More precisely, SXPath is able to parse location paths with "functional" location steps in addition to standard XPath. The result of this parsing is a function which corresponds to given location path. On figure 4, this corresponds to the inner Scheme expression:

(sxpath "rdf:RDF/rdf:Description/dc:title/text()")

The constructed function is applied to the SXML document. Given a document, the function returns a node-set as a result of evaluating the location path for this document. On figure 4, this corresponds to the outer Scheme expression:

((...) doc)

This approach has the following advantages:

SXPath-created function may be applied repeatedly to different XML documents. Say, we can bind the constructed function to a Scheme identifier for future references:

(define f (sxpath "rdf:RDF/rdf:Description/dc:title/text()"))
and then use this function several times:

(f sxml-doc1)
(f sxml-doc2)
For example, this can be useful when we evaluate a single location path for several documents, or perform XSLT pattern matching [13]. Please note that location path parsing is performed just once.

The similar strategy can be used for placing location path parsing in a non-critical part of our program (say, initialization). In this case, time-critical parts of a program will just apply already constructed functions.

As it was discussed in our previous article [3], SXML employs the concept of namespace-ids which is quite similar to XML namespace prefixes. Similarly to a prefix, a namespace-id stands for a namespace URI. The distinctive feature of a namespace-id is that there is a 1-to-1 correspondence between namespace-ids and the corresponding namespace URIs. This is generally not true for XML namespace prefixes and namespace URIs.

A namespace-id can be thought of as just a shortcut for a namespace URI in SXML names. It is important to note that namespace-ids are chosen by the application developer, because they are introduced during XML document converting to SXML. Figure 5 illustrates this idea with a function call to SSAX parser [12]. The second argument for the function call specifies desired shortcuts for some particular namespace URIs. When parsing an XML document, SSAX automatically resolves XML prefixes with the corresponding URIs, and then replaces some of them with namespace-ids, in accordance with the specified shortcut.

Since both prefix names in an XPath location path and namespace-ids in an SXML document are chosen by the application developer, SXPath considers by default that a prefix in a location path denotes a namespace-id. This SXPath behavior is natural, because a namespace-id is a shortcut for a namespace URI in SXML names. Formally, if no namespace declarations are provided to sxpath function, each prefix appearing in a location path is considered as the namespace-id of the same name; i.e. no prefix mapping is performed.

SXPath can accept namespace declarations in an optional second argument. This is illustrated by figure 6. Namespace declarations, being the list of pairs (prefix . namespace-URI), specify the mapping from prefixes in the XPath location path to namespace URIs. This mapping is necessary for evaluating the XPath location path for an SXML document with directly qualified names [3], like the one shown on figure 7.

Not only SXPath interface is convenient, the implementation of XPath with functional methods is itself very natural and straightforward. In this section we will give an illustration of how basic XPath concepts are realized as a set of low-level functions in SXPath library.

As mentioned in section 3, SXPath models XPath node-sets as Scheme lists whose members are SXML nodes. Although SXML nodes are often lists themselves, we can always distinguish a node-set from a node: if a node is a list it always has a Scheme symbol (the name of the node) as its first member. On the contrary, a non-empty node-set obviously has its first element being an SXML node, which can never be a Scheme symbol.

The nodeset? predicate provided by the SXPath library, implements exactly this logic: it returns true for either a null list (an empty node-set) or for a list whose first member is not a Scheme symbol. The implementation of nodeset? predicate is shown on figure 8; car is a basic Scheme function for taking the first member of a list.

As discussed in section 2, an XPath node test is used to specify the condition on nodes selected by the location step. Since the XPath Recommendation defines node test in terms of being true for some kinds of nodes, it is natural to implement a node test as a predicate which takes an SXML node and returns a boolean value depending on whether the node satisfies the node test.

Figure 9 gives the simplified implementation for two XPath node tests considered in this paper:

Function text? implements the node test text(), which accepts text nodes only. In SXML, the text node is represented as a Scheme string, so the standard Scheme predicate string? performs this task.

Function ntype?? implements the node test that is true for a node with the definite name. This function is first parameterized with the name, and then returns the predicate applicable to nodes:

(lambda (node)
(and (list? node) ; a text node doesn't have a name
(eq? name (car node))))
The implementation of this predicate is obvious: a node must be a list (because text nodes don't have names at all) and its first member (obtained by car) must be equal to name parameter.

A perfect suitability of SXML as a data model for XPath is illustrated by the implementation of XPath childaxis shown on figure 10.

Since the axis and node test work in conjunction, SXPath designs axes as parameterized by node test predicates. When supplied the node-test predicate, select-kids produces the function which takes the SXML node to return its children:

(lambda (node)
(if (text? node)
'() ; no children
(filter node-test (cdr node)))))
A text node obviously has no children, so the empty node-set is returned in this case. Otherwise (the node is a list), the node's children (selected by the basic Scheme function cdr) are filtered in accordance with the node-test predicate. Function filter preserves only those child nodes, for which the node-test predicate is true.

Since SXML represents elements and attributes in the uniform way [10], a specific implementation of attribute axis is not required: in SXPath it's just a particular case of child axis.

SXPath library includes several combinator functions. In particular, the node-join combinator joins several functions representing location steps into a complete location path. The implementation of node-join is routine and is not given here. Instead, on figure 11 we will consider one particular combination of already discussed low-level SXPath primitives. This combination includes four location steps, and it implements the XPath location path from figure 2. The apostrophe is used to include literal constants in Scheme code.

(sxpath "rdf:RDF/rdf:Description/dc:title/text()")
will produce a function very similar to the one shown on figure 11. Low-level SXPath functions thus constitute a virtual machine into which higher-level expressions are compiled [14].

In additional to a "textual" W3C-compatible syntax for an XPath location path, the SXPath library also introduces its own "native" representation for a location path -- in the form of a list. In this representation, location steps are written as members of this list. The idea is illustrated by figure 12, and its inner expression

(sxpath '(rdf:RDF rdf:Description dc:title *text*))
will also produce a function very similar to the one on figure 11.

((sxpath '(rdf:RDF rdf:Description dc:title *text*))
doc)

Table 12: Evaluating the "native" SXPath analog for the location path from figure 2 (here we suppose that the identifier doc is bound to the SXML document from figure 3)

With this "native" SXPath representation for a location path, SXPath provides a natural ability for a Scheme function to serve as a location step in a location path, since a Scheme function is a low-level representation for a location step in SXPath. To illustrate this statement, we will rewrite our example from figure 12, by representing the third location step in its explicit form as the SXPath low-level converter (figure 13). Please note that examples on figures 4, 12 and 13 have equivalent semantics. Although example on figure 13 may seem to involve unnecessary complexities, it illustrates a very important feature of SXPath. Namely, the ability to use an arbitrary Scheme function as a location step in a location path provides SXPath with virtually unlimited capabilities. In particular, this feature makes SXPath a query language, as it will be discussed in the next subsection.

Table 13: Evaluating the "native" SXPath representation for a location path from figure 2, with the third location step in the form of Scheme function (here we suppose that the identifier doc is bound to the SXML document from figure 3)

XPath is defined by the W3 Consortium as a language for addressing parts of an XML document. This means, in particular, that an XPath location path cannot produce anything other than a set of nodes from the document with respect to which this location path is evaluated.

SXPath, however, provides features of a query language, because SXPath allows to formulate arbitrary requests for information from XML/SXML documents and to generate arbitrary reports from it.

SXPath "native" representation for a location path is an S-expression and may include arbitrary procedures defined in Scheme in addition to a set of of predefined XPath primitives.

For an illustration of this feature, we will modify our example location path which selects the title of the book. In accordance with Dublin Core semantics we can suppose that a book description always has exactly one title (i.e. exactly one dc:title element), so we may wish the location path to return just this title, not the node set. Then, suppose that we want this title in the form of an uppercase string. We make these actions as the fifth custom step of the location path (figure 14). When evaluated, this location path will produce the string "ALGEBRA". Note that there was no "ALGEBRA" text node in the SXML document, this string was constructed as the result of the SXPath query.

Table 14: A query in SXPath (here we suppose that the identifier doc is bound to the SXML document from figure 3)

The ability to use Scheme functions as SXPath predicates, and SXPath selectors in Scheme functions makes SXPath a truly extensible language [14]. A user can compose SXML queries following the XPath Recommendation -- and at the same time rely on the full power of Scheme for custom selectors.

XPath is the key language in the stack of XML technologies, and is used by many important XML languages, in particular XSLT, XPointer and XQuery.

With its support for XML Namespaces, XPath can be used for selecting information from descriptions expressed in Resource Description Framework, an important technology for electronic libraries.

SXML was designed with a goal of effective evaluation of XPath expressions in mind. SXPath -- compliant XPath implementation in functional programming language Scheme -- extends XPath with features of a query language and seamlessly integrates it with Scheme. SXPath/Scheme combination provides a powerful technology for XML-based digital libraries, and is especially suitable for implementation of light-weight digital libraries.

About Authors

Dmitry Lizorkin - a Ph.D. student in the Moscow State University. His M.Sc. thesis, defended in 2002, was dedicated to implementation of XML Linking Language (XLink) using functional methods.
e-mail: lizorkin@hotbox.ru

Kirill Lisovskiy - PhD, IT Consultant and Senior Researcher Institute for System Programming Russian Academy of Science. His primary area of research interests is functional and logic techniques for semistructured data management. Since 1999 he had participated in a number of research and development projects related to implementation and application of XML data management techniques based on the Scheme programming language.
e-mail: lisovsky@acm.org http://pair.com/lisovsky