Resource Directory (RDDL) for
Hook 0.2

A One-Element Language for Validation of XML Documents
based on Partial Order

This document is a RDDL
Resource Directory Description for the Hook 0.2 validation language,
which is an XHTML document with special XLinks that
locate various resources useful for Hook.

The Hook validation language is a thought experiment in minimalism in XML schema languages. The purpose of such a minimal language would be to provide useful but ultra-terse success/fail validation for basic incoming QA, especially of datagrams. It is like a checksum for
a schema.

The validation it performs can be characterized as
"Does this element have a feasible name, ancestry, previous-siblings
and contents?", there being some tradeoff between the how
fully the later criteria are tested.

Let us start with the following technical criteria:

Smaller than DTD: if it is downloaded from a server as a separate file,
it should be downloadable in the first packet group,
so less than 512 (the minimum MTU) -100 (for MIME header) =412 bytes.

Implementable by a streaming processor

No forward references

No pathological schemas as far as blowouts

An efficient implementation should be possible

Suitable for coarse validation of document for some significant issues

The schema should be namespace-aware

The minimal schema should only require 1 element or perhaps fit in a PI

The datatype should be expressible using XML Schemas regular expressions
or simple space-separated tokens.

The schema paradigm is the (partial) ordering of elements against the information kept during stream processing

The Language

A Hook schema is an element containing a list of element names,
some of which may be grouped by
square brackets. This list represents a certain ordering of the
names and validation consists of checking conformity to this ordering.

The friendly attribute is whether elements from other namespaces are allowed.

The short attribute is whether the all elements in the namespace have been mentioned or not; if not then unmentioned elements are allowed as if specified in a group at the end of the schema.

The top attribute specifies whether the first element in the schema must be the document element (or the local root of a branch starting this namespace).

The order element has the following grammar, where
s is one or more whitespace (or string-start or string-end)
and NCame is an XML name with no colons.

s ( (NCName "."? s )|
( "[" s (NCname ("."|";")? s)+ "]" s )
)+

The order element specifies an ordering of elements; element grouped by square brackets are in the same level or order.

Validation occurs by, for each element in the document proceding in document (streaming) order, checking that every previous-sibling element at the same level and then each ancestor element are ordered according to the list order (ignoring intermediate list items, but failing if there is no corresponding item in the schema to any element.) A name may appear more than once.
(Actually, an implementation only needs to look a the first child and
or next-sibling to perform validation, but explaining it this way around
may make the syntax easier to understand.)

A fullstop (period) on an element indicates that
the element may have no contents (no subelements and the space-normalized value
of the contents is zero): this is almost the same as EMPTY. A semi-colon indicates that the current group will be broken out of: the
named element cannot be contain by elements in the same
group. (It can still be followed by elements of the same group.)
A semi-colon in a group at the end of a schema thus indicates that simple content only is possible

Normally [ x y ] allows

<x><y/><x/><y/></x>
<y><x/><y/><x/></y>

but [ x y; ] allows

<x><y/><x/><y/></x>

but not

<y><x/><y/><x/></y>

while [ x y. ] allows

<x><y/><x/>;<y/></x>

but not

<y><x/></y>

So [ x y ] means

an x can contain any number of nested x and y before any other
element

an x can be followed by any number of x and y before any other
element

a y can contain any number of nested x and y before any other
element

a y can be followed by any number of x and y before any other
element

a y can be followed by any number of x and y before any other
element

but [ x y; ] adds the constraint

a y cannot contain a y next (unless the next particle in the hook
schema happens to be a y e.g. [ x y; ] x )

a y cannot contain an x next (unless the next particle in the hook
schema happens to be an x, e.g. [ x y; ] x )

So ";" is used to break out of the recursion allowed in a [ ] group.

Intuitively, this is like first making a big list of every element allowed, putting them all in a choice group. This gives us a complete definition of every allowed element:
it defines the namespace and catches spelling errors. Next, if there is some element(s) that can start, move them out to the front (or copy them if they can reappear.
Now the schema validates the top-level elements too. Next, if there are some elements that can only appear as the last elements in a coment model ( e.g. the z in (x, y, z) or the b and c in ( a, (b | c)*) ) then move these out to a group at the end.
Now we have validation for elements in simple mixed content. Continue factoring until done.

It is quite possible that there are languages which exhibit orders
that cannot be usefully captured. In those cases, a hook schema still
can show the top element, all names in the namespace, and which elements
must be empty.

This schema captures a lot of containment relationships OK,
I think: probably it has some mistake.
But it will not detect what may be a common XHTML problem, where
omit-end-tag HTML elements like <body> are converted to <body />.
However it will detect problems like <meta> not being converted
to an empty tag and so spuriously including other head elements.

This is a much more successful example! Note, every valid PO document
will also be valid against this schema and that the schema validates all sequence
requirements. What it won't catch is if an end-tag is in the wrong palce w.r.t what should be
a sibling. So it seems that Hook may be good for validating datagrams of this kind.

Again, this is pretty good: there is a good amount of order to capture. The
"daignostics diagnostic" could also come before or or after rule

In all four cases above the character count is less than 400 characters, so it looks
like they would be retrieve in the first packet group from a server.

Comments

Hook seems to suit languages that have large flat bottoms, languages specific requirements early on in each content model, languages with specific elements that do not re-occur in different contexts with different priorities, languages with attributes that are not vital or will be checked by other mechanisms.

Hook would seem useful as a coarse-grained but ultra-terse validation language.

If we say that validation is to catch errors that are most likely to happen, the most
likely errors are spelling errors, children in the wrong order, and required parents: Hook
gets or catches most.

How much would this help an interactive editor? It would know which elements can start,
but for new documents it would present to many choices: however if editing existing documents it would cull the available list pretty well, because it would know what the current level was.
It would know empty elements.

It would be nice to signal order by < but too much markup would be required.

Formalization

Joe English has posted interesting material regarding formalisms
for Hook, algorithm for implementing and other material.
See XMLHACK.COM item.

Why Hook?

The name Hook comes from a supposed hook shape of drawing this on a parse tree tracing previous-sibling then up the descendents.

Related Resources for Hook 0.2

Well known URI

The well known URI for Connect is http://www.ascc.net/xml/hook.

Root namespace URI

http://www.ascc.net/xml/hook is the namespace of the root
element of a hook program.

Copyright 2001 (C) Rick Jelliffe

There is no Hook 0.2 software from me, however if you make some, please
consider making that software available under
the conditions of the zlib/libpng license (the least restrictive).
Comments, fixes and upgrades welcome: email ricko@gate.sinica.edu.tw