Hedge automata: a formal model for XML schemata

MURATA Makoto (FAMILY Given)

Fuji Xerox Information Systems

Introduction

This note shows preliminaries of the hedge automaton theory. In the
XML community, this theory has been recently recognized as a simple
but powerful model for XML schemata. In particular, the design of
RELAX (REgular LAnguage for XML) is directly based on this
theory.

Hedges

First, we introduce hedges. Informally, a hedge is a sequence
of trees. In the XML terminology, a hedge is a sequence of elements
possibly interevened by character data (or types of character data); in
particular, an XML document is a hedge.

A hedge over a finite set Σ (of symbols) and a finite set
X (of variables) is:

ε (the null hedge),

X, where X is a variable in X,

a <u>, where a is a symbol in Σ and u is a hedge (the addition of a symbol as the root node), or

uv, where u and v are hedges (the concatenation of two hedges).

Figure 1 depicts three hedges: a <ε> , a <x > , and a <ε > b < b <ε> x > .
Observe that elements of Σ (i.e., a and b ) are used as
labels of non-leaf nodes, while elements of X (i.e., x ) are used
as labels of leaf nodes. We abbreviate a <ε> as a.
Thus, the third example is denoted by ab <b x > .

Figure 1. Three hedges: a <ε> , a <x > , and a <ε > b < b <ε> x >

Next, we consider an XML document.
Suppose that Σ = {doc, title, image,
para} and X ={#PCDATA}. Then, doc <
title<#PCDATA>
para<#PCDATA>
<image/>
para<#PCDATA>>
is a hedge. In the XML syntax, this hedge can be represented
as below:

Both the rule for non-terminal n1 and that for
n2 have segment in the right-hand
side. However, the former has content model
np*n2*, and
the latter has content model np*. This
impiles that top-level segments can have subordinate segments, but
these subordinate segments cannot have subordinate segments.

The DTD syntax cannot exactly capture this RHG, since every
occurrence of segments is forced to have the same content model.
The smallest DTD that covers this RHG is as below:

<!ELEMENT segment (para*, segment*)>
<!ELEMENT para (#PCDATA)>

Observe that this DTD allows unlimited nesting of segments. Since
the DTD syntax does not allow two content models for segments,
this DTD uses one loose content model.

Hedge Automaton

In this section, we introduce deterministic hedge automata and
non-deterministic hedge automata.

A deterministic hedge automaton (DHA) is
< Σ, X, Q, α, ι, F>, where:

Σ is a finite set of symbols,

X is a finite set of variables,

Q is a finite set of states,

α is a function from
Σ × Q* to Q such that for every q in Q and
x in Σ, {q1q2 ... qk | k >= 0, α(x,
q1q2 ...
qk
) = q } is a
regular set,

α is a relation
(called transition relation) from
Σ × Q* to Q (or a function from
Σ × Q* to 2Q )
such that for every q in Q and
x in Σ,
{q0q1...qk | k >= 0, α(x, q0q1...qk, q)} is a regular string
language, and

ι is a relation from X to Q (or a function from X to 2Q).

By definition, a DHA is also a NDHA. We only have to confuse a state
and a singleton set containing that state. Thus, the above DHA is also
an example of NDHAs.

The last example RHG in Section \ref{RHG} can be readily captured by a
NDHA <Σ, X, Q, α, ι, F >, where

Σ =
{segment, para },

X = { #PCDATA },

Q = < q1, q2, qp, q# >,

α(a, u ) contains
q1 when a = segment and u
in L ( qp*q2* ) ),

α(a, u ) contains
q2 when a = segment and u in L ( qp* ),

α(a, u ) contains
qp when a = para and u in L(q#)),

ι(x ) =
q# when x = #PCDATA,

F = q1 .

Properties of Regular Hedge Languges

Equivalence

The following conditions are equivalent.

L is generated by a RHG,

L is accepted by a DHA, and

L is accepted by a NDHA.

The proof that (3) implies (2) is done by the subset construction.
The rest of the proof is straightforward.

Boolean closure

Suppose that set L1 and
L2 are accepted by (N)DHA
M1 and M2,
respectively. We can effectively construct (N)DHAs that accept the
following languages.

the intersection of L1 and L2,

the union of L1 and L2,

the complement of L1 (the set of all hedges not contained by L1)

Parse trees of extended context-free grammars

The set of parse trees of an extended context-free grammar is said to
be a local tree language. A lot is known about the
relationships between local tree languages and regular hedge
languages. We mention two observations which are directly relevant
to XML.

A local tree language is a regular hedge language (in other
words, for any extenced context-free grammar, we can construct a
DHA.), and

For any regular hedge language that contains trees only, there
exists a unique minimal local tree language that includes that regular
hedge language.

Observation (1) implies that RHGs are more powerful than DTDs, while
(2) ensures that given any RHG, we can construct a reasonable DTD.

BIBLIOGRAHICS NOTES

In the theoretical computer science community, regular hedge languages
were first studied by Pair et al[PQ68] and Takahashi[Tak75]. Regular hedge language can also be considered as
extensions of regular tree languages [Tha67]. We borrow
some concepts from these papers but adopt definitions more similar to
those for regular string languages.

We define RHG's similarily to [PQ68,Tak75], but we
avoid projections. Alternatively, our definition can be considered as
a hedge-version of Brainerd's tree regular grammars (called "tree
generating regular systems) [Bra69].

Our definitions of NDHAs and DHAs are derived from (non-)deterministic
tree automata of [Tha67] except that we have extended them
to hedges.

It was Kil-Ho Shin (Fuji Xerox) who first proposed to use regular
hedge languages as a formal model for schemata of structured
documents. His proposal dates back to November, 1991, but he never
published any papers. In search of a
formalism for document schemata, HIYAMA Masayuki (FAMILY Given)
reached a similar formalism in 1996.
Since 1993, the present author has applied
regular hedge languages (and hedge monoids, which are outside the
scope of this note) for schema transformation
[Mur97a,Mur97b,Mur98].

The word ``hedge'' was originally proposed by Bruno Courcelle
[Cou89]. Derick Wood recommended the use of this word, and
it has become the standard word in the XML community after a tutorial by
Paul Prescod in 1999. For more information, see the special section on
hedge automata in the
he SGML/XML Web Page (http://www.oasis-open.org/cover/topics.html#forestAutomata).