Schema Library

Customising

Testing

Project Management

How to write a DTD schema

The following guide illustrates how to write a DTD (Document Type Definition) schema file for xmlfy.

Planning the DTD schema

It is good practice to write an accurate and well formed DTD schema because it may be
useful to programs other than xmlfy in the future.

You can also build a library of schemata for the variety of data that you may want to
xmlfy and store them in a common directory.

Always start with a simple DTD schema and gradually build up its complexity because shoe-horning
foreign raw data into XML format using only a DTD schema file can be quite a feat particularly
if you want to use complex and nested element structures.

Analysing the input data

The key to writing a good schema file is to understand the data that it is trying
to describe.

For example, lets look at the output of the ls -la command from a Cygwin shell.

You can see that five lines of data were returned with two different structures of data being
presented (line 1 = summary total, lines 2-5 = file details).

Lets call the "summary total" structure the total record and the "file details" structure
the file records.

Defining the root element

Under certain circumstances the total record may not appear e.g. ls -la just-one-file. This means
the total record has a none or one relationship with the output of the ls -la command.

The file record may appear many times or none at all depending on the number of files
returned by the ls command. This means the file record has a none or many relationship
with the output of the ls -la command.

We can now write the first line of our DTD schema file to look like this:

<!ELEMENT ls (total?), (file*)>

This is saying that the root element which is called ls is comprised of two
elements with the first element total occurring none or one times (represented
with the ? token), and the second element file occurring none or many times
(represented with the * token).

Defining the record elements

The total record will always have two fields prompt and totalsize that will
always be present in this record.

We can now write the second line of our DTD schema file to look like this:

<!ELEMENT total (prompt, totalsize)>

This is saying that the record element which is called total is comprised of two
elements with the first element prompt occurring only once, and the second element
totalsize also occurring only once.

The file record can have a variable number of fields up to a maximum of nine with
one of those fields fname being mandatory.

We can now write the third line of our DTD schema file to look like this:

This is saying that the record element which is called file is comprised of eight optional
elements occurring none or one times (represented with the ? token), and with the last element
fname occurring only once.

The date_ty record can be represented in either hours:minutes or year.

We can now write the next two lines of our DTD schema file to look like this:

<!ELEMENT date_ty (date_y)>
<!ELEMENT date_ty (date_h, date_m)>

Defining the field elements

Strictly speaking xmlfy does not require any further definitions to work because it ignores elements
in the DTD schema file that have the strings (#CDATA) or (#PCDATA) in them. But it is good
practice to furnish a complete DTD schema so we include the field element definitions.

We can now write the final lines of our DTD schema file to look like this:

A word on capturing data when using a schema file

Shoe-horning raw data into a structure defined by a schema is
rather straight forward when the input fields have a one-to-one
relationship with the fields of the schema elements, however if
wildcard tokens and/or Boolean logic are employed in the schema
then it becomes quite a challenge, sometimes even impossible,
to be deterministic about which input field belongs to which
schema field. Strictly speaking, the main function of the schema is
to ensure XML is valid and to do this requires the XML
document to already pre-exist. In xmlfy's case we are doing the
reverse by building an XML document on the fly while following
rules described by the schema - this is still okay and the resulting
XML can be considered to be both valid and well formed.

xmlfy employs two techniques to help with this shoe-horning
input data problem. The first technique xmlfy uses is recognising
multiple element definitions that have the same name. This allows
you to capture your schema elements under a variety of input
circumstances without having to create a unique element for each
circumstance - you can still do that if you want. The second
technique xmlfy uses is auto-generated field match constraint
helpers to assist in matching the input fields to the elements
described by the schema. These helpers are useful in improving the
speed of xmlfy particularly when using compound element structures
and wildcard tokens in the schema hierarchy. After the schema file is
loaded into memory, an array of helpers is generated for each
element that describes all combinations of the schema tree traversal
paths that can be taken and associates each combination with the
minimum, maximum and last number of fields required for a match
against the number of available input fields.

By default xmlfy continuously iterates through just the record
elements of the root element looking for element helpers that can
fully satisfy the requirements of that particular element's schema
tree hierarchy for the given input fields, after which the matching
record element is then checked against its wildcard obligations in
the root element definition, and if okay is finally printed.
In match direct mode xmlfy only looks at the element helpers of
the targetted element, and if that element can fully satisfy the
requirements of its schema tree hierarchy for the given input fields,
is printed in its entirety only once as the root element.

Important note

Currently the xmlfy DTD schema file parser is not that sophisticated and exhibits the
following limitations:

Only recognises the <!ELEMENT> directive and ignores all others.

The first valid <!ELEMENT> definition becomes the root element.

The fields of the root element define all the level 1 elements (lets call
the fields that have their own branch structure record elements).

The fields of the record elements simply represent other elements and unlimited
element nesting is allowed.

By default fields of the root element that are not record elements are ignored.
Use the match direct option to match targetted elements in their entirety.

Element fields that don't have an element definition default to being (#PCDATA).

Elements defined inside the DTD schema file as (#PCDATA) or (#CDATA) are ignored
causing the referring field to default to (#PCDATA) however it is good practice to include
these elements in order to furnish a complete DTD schema.

Only honours the +, ? and * wildcard tokens.

At this stage does not honour field group sets () and or-ing ¦ syntax tokens.

The field names that are specified in the element definitions are read
from left to right and matched against a field number calculation on the
input fields, and then matched again on any wildcard tokens.

You can wildcard many fields but you should think clearly about what you are
trying to achieve and whether it is at all possible.

For example, the following DTD schema which is perfectly suitable for checking
for valid XML, will however prove impossible for xmlfy to shoe-horn
input data into schema elements a, b and c reliably because
more than one field has a wildcard token to match none or many input fields.

In the above example xmlfy will allocate ALL input fields to element <a>
and that MAY not be the desired intention.

Don't worry if you find some of the above hard to digest, as you get more familiar with writing
schemata this will become clearer.

Conclusion

That concludes the DTD schema writing process. xmlfy provides a significant number of
command line options to change the behaviour of its processing of the input and
output stream over and above the DTD schema file supplied. You are encouraged to
experiment a little with xmlfy to get comfortable with these features.