TREX Basics

In this article, we'll explore the TREX
markup language for validating XML documents, focusing on validating a subset of XMLNews-Story Markup Language. Although the
XMLNews-Story markup language has been superseded by the News
Industry Text Format, we use the old version because it's simple, it looks a great
deal like HTML, and it lets us easily show some of TREX's features.

TREX's author, James Clark, says,

"A TREX pattern specifies a pattern for the structure and content of an XML
document. A TREX pattern thus identifies a class of XML documents consisting of those
documents that match the pattern. A TREX pattern is itself an XML document."

A TREX pattern's outer parts

A TREX pattern is enclosed within a <grammar> element. This is followed
by a <start> element that describes the pattern. The root element of a
news story is the <nitf> element, so that's where we begin our TREX
specification.

The <grammar> and <start> elements are required only
if you want to modularize your pattern by using definitions. (See
below.) Since most non-trivial TREX patterns use definitions, we'll start using them
right away.

Sub-Elements

An news story consists of a <head> and <body>
element, in that order. The <head> contains a <title>,
which has string data as its contents. All of these are required elements, and TREX
specifies them as follows:

Multiple Occurrences of Elements

One of TREX's greatest strengths is the ease with which you can specify the number
of
occurrences that an element or elements must have. Specifying an
<element> all by itself means that it is required, and that it must
occur exactly once. Specifying multiple occurrences is easy in TREX.

Enclose elements in...

to specify...

<optional>

zero or one occurrence

<zeroOrMore>

zero or more occurrences

<oneOrMore>

one or more occurrences

We need this information to describe the <body> of a news story, as it
starts with optional header information. This header information consists of a
<body.head> element, which contains an optional
<hedline> (yes, it's really spelled that way) and zero or more
<byline> elements. Each of these has sub-elements, as shown below in
the TREX pattern. This entire text would go between the <element
name="body"> and its corresponding </element>.

Try It Yourself

You can download TREX and see that the document is, indeed, valid. If you're on a
Windows
system, you have an executable already available to you. If you're using Linux or
Unix,
download trex.jar, xp.jar, and sax2.jar, and use this
shell script:

As a grammar gets more complex, it makes less sense to have it all in one huge block.
TREX
lets you <define> a series of elements and then refer to the defintions.
We'll take the information for the body header of a news article and put it after
the
<start> element. Here's what the last part of the file now looks
like:

You can make recursive definitions. For example, we can say that a
block_item is zero or more choices of a paragraph, <p>,
the empty image element, <img>, or an unordered list
<ul>. We use TREX's aptly named <choice> specifier
in the pattern below.

Note that TREX requires you to explicitly define elements which have neither children
nor
attributes as empty. Since we haven't learned about attributes yet, we've specified
the
image element as <empty/>.

Now we can define an unordered list, which can itself contain block items; i.e.,
paragraphs, images, and lists within lists.

XML elements can have attributes, and TREX allows you to specify them in great detail.
A
news story, like HTML, can include an <img> tag which has a required
src and optional align, width, and
height attributes. The alignment can have only three possible values, so we
specify them explicitly with the <string> element.

The width and height must be positive integers. Since TREX doesn't have any default
type
system, the current implementation of TREX reaches out to XML Schema and uses its
type
system. That means we need to specify a namespace when we write the pattern for an
image
element.

Notice that <optional> can be used with <attribute>
to specify an optional attribute, just as it is used with <element> to
specify an optional element. This uniform treatment of attributes and elements gives
TREX
the power to express complex grammars with a compact vocabulary. (For all the details,
check
out the a complete TREX
pattern that uses attributes and the XML News Story that uses an image.) In the TREX file, the
xmlns:xsd specification has been placed in the outermost
<grammar> tag so that it's available throughout the file.

Just as it was possible to create a reusable element specification, it's possible
to
create a set of attributes that can be reused by many tags. For example, both table
body
(<tbody>) and table header (<th>) elements have
identical attributes for determining their horizontal and vertical alignment. This
makes
those attributes a perfect candidate for a definition,

Now let's get a bit more advanced. Let's say that we want to use this block element
specification for both XML News Story and XHTML verification. The problem is that
news
stories have <location> and <copyrite> block elements
and XHTML doesn't; XHTML has a <blockquote> element, but news stories
don't. So, we'll modify our include file as follows.

The <notAllowed/> is a pattern placeholder that can never match
anything. This would be a problem if the include file were to be used by itself, but
our
XMLNews-Story TREX pattern will replace the no-op pattern with this pattern:

This include-and-override capability lets you develop a set of core patterns that
can
easily be modified for validating a wide variety of markup languages. Other options
for the
combine attribute are choice and group, which let
you add to a definition without entirely replacing it.

Another advanced feature of TREX is the <concur> element, which lets you
verify that your XML satisfies all of a number of patterns.

Summary

TREX is a powerful markup language that permits you to specify how other XML documents
are
to be validated. As with other specification languages, you can