Proceedings of Extreme Markup Languages®

Declarative specification of XML document fixup

Henry Swift Thompson

Abstract

Fixing broken XML is a problem interesting in its own right, and fixing
broken (X)HTML is a particularly timely topic. This paper
introduces a declarative approach to specifying the repair of ill-formed XML,
based on the TagSoup work of John Cowan. A prototype implementation, called
PYXup, is described, which operates on the PYX output of
the scanner phase of TagSoup to fix-up all well-formedness and some structural problems.

Table of Contents

Henry Swift Thompson

Henry S. Thompson divides his time between the School of Informatics
at the University of Edinburgh, where he is Reader in Artificial
Intelligence and Cognitive Science, based in the Language Technology
Group of the Human Communication Research Centre, and the World Wide
Web Consortium (W3C), where he works in the XML Activity.

He received his Ph.D. in Linguistics from the University of California
at Berkeley in 1980. His university education was divided between
Linguistics and Computer Science, in which he holds an M.Sc. While
still at Berkeley he was affiliated with the Natural Language Research
Group at the Xerox Palo Alto Research Center, where he participated in
the GUS and KRL projects. His research interests have ranged widely,
including natural language parsing, speech recognition, machine
translation evaluation, modelling human lexical access mechanisms, the
fine structure of human-human dialogue, language resource creation and
architectures for linguistic annotation. His current research is
focussed on the semantics of markup, XML pipelines and more generally
articulating and extending the architectures of XML.

He was a member of the SGML Working Group of the World Wide Web
Consortium which designed XML, a major contributor to the core
concepts of XSLT and W3C XML Schema and is currently a member of the
XML Core, XML Schema and XML Processing Model Working Groups of the
W3C. He has been elected twice to the W3C TAG (Technical Architecture
Group). He was lead editor of the Structures part of the XML Schema
W3C Recommendation, for which he co-wrote the first publicly available
implementation, XSV. He has presented many papers and tutorials on
SGML, DSSSL, XML, XSLT, XML Schema, XML Pipelines and Web Architecture
in both industrial and public settings over the last ten years.

Declarative specification of XML document fixup

Extreme Markup Languages 2007® (Montréal, Québec)

Introduction

The historical and social dimensions of the current
situation and
future trajectory of the HTML family of languages are so complex as to defy
easy analysis (see for example [TAG07]), but one issue at least stands out: should 'the next HTML' be
an XML language, and should it have an explicit schema? One major
constituency has specifically rejected the use of any
form of schema, and is describing their candidate for 'the next HTML' only
using (somewhat stylised) natural language [Hickson07].

The
primary motivation for the 'no schema' approach is the perceived necessity of
specifying the language in a way which accommodates the brutal reality of
current practice: most HTML on the Web today is invalid, but browsers process
it pretty consistently none-the-less. The differences between HTML and XHTML are
at best poorly understood, and rarely respected. The complex interplay of the relevant
technologies and human propensities are judged likely to mean that this
situation will persist indefinitely. In [TAG07] the W3C's
Technical Architecture Group asks

"Is the indefinite persistence of 'tag soup' HTML consistent with a
sound architecture for the Web? If so, (and the going-in assumption
is that it is so), what changes, if any, to fundamental Web
technologies are necessary to integrate 'tag soup' with SGML-valid
HTML and well-formed XML?

. . .

"Can we [treat tag soup] 'as if' it was the serialization of the DOM
produced by (some formalization of) common browser error recovery strategies?"

This paper explores one possible route to implementing the 'as if'
suggestion, particularly the "some formalization of" parenthetical. There is
one outstanding precedent for this, namely John Cowan's TagSoup[Cowan04].
Although to a certain extent TagSoup's
behaviour is driven by declarative specifications, there are significant
aspects which cannot be affected easily, or at all, without modifying the code.
Preliminary exploration of matching the behaviour of TagSoup to the error-correcting behaviour of current browsers ran into difficulties for this reason. The PYXup system described here is an attempt to reconstruct part of TagSoup in a way which exposes more behavioural options to external declarative control.

Architecture and Methodology

PYXup is closely based on [Cowan04],
which has two distinct components, both implemented in Java:

A table-driven scanner;

A schema-driven 'rectifier'.

The first component, the scanner, which parses input into a stream of
low-level tokens, is already pretty much completely
customisable via the external table which controls it. This table is expressed
in XML.

The second component, the 'rectifier', which deals with, for example,
missing end tags, is controlled in part by an external schema, again expressed
in XML, but the degree of customisation this provides for is fairly limited.

Accordingly, PYXup uses the scanner component of [Cowan04] unchanged, but implements its own 'rectifier', operating on the PYX [McGrath00] output of the scanner to produce a fixed-up stream of PYX events.

The fixup component of PYXup is driven by two
declarative inputs:

A simplified grammar;

A set of fixup rules.

The simplified grammar is expressed in XML, and can be derived
automatically from a W3C XML Schema schema (and thus from other
grammar-orientated schema languages such as DTD and Relax NG). It consists entirely of immediate
dominance rules, and is interpreted as specifying a very regular push-down automaton. The fixup rules are likewise
expressed in XML, and are interpreted as a kind of production system, with
patterns expressed in terms of problem type and parsing state and actions
expressed in terms of output events and operations on parser state.

Schema details

[Cowan04] defines a simple schema language,
TSSL [Tag Soup Schema Language], which represents primarily immediate dominance information,
with no sequencing, disjunction or cardinality information. PYXup uses a
similar approach, with even less detail. A PYXup schema is a set of rules,
each of which specifies only

nonTerm

An element name

children

A set of element names

mixed

A boolean

maybeRoot

A boolean

preferredParent

(optional) An element name

A PYXup schema also may specify a preferred parent for otherwise
not-allowed text.

A PYXup schema is interpreted as a push-down automaton, in which each
rule is corresponds to a finite state automaton, whose name is given by its
nonTerm, as follows (for an element named nt):

The )) transition is to accommodate the
PYX event generated for empty tags and the anachronistic
</> markup.

Furthermore, for every child a transition from and to state 2 is added,
labelled with the name of the corresponding FSA. If the element allows text
content, a transation labelled -* is added likewise. For example, for the
b element as defined above, that is <!ELEMENT b
(#PCDATA|c)*>, the result would be

Such an automaton will accept the beginning and end
PYX events for a b element, with any number of balanced
c element begin/end pairs, mixed with any number of text events,
in between.

Finally, the collection of automata is transformed to simplify
processing and improve
efficiency, by as it were promoting the opening transition from each embedded
automaton, so that for example the automaton for the a element as
defined above, that is <!ELEMENT a (b,d)> as actually used
looks like this:

This change implies that there is must also be a root automaton which
dispatches on the open event for all possible document elements to the
appropriate named automaton.

The automata act as recognisers and identity transducers, that is, as
each event is accepted it is echoed.

Problem taxonomy and fixup details

What can go wrong, and how can it be fixed? At one level, there are
only four things that can go wrong:

There is no input, but we're not done;

There is more input, but we are done;

We accept an open event, but there is no corresponding automaton;

The current event cannot be accepted in the current state.

Case (2) is currently not detected by PYXup. This means that
strictly speaking PYXup's output is a well-formed XML
entity, that is, it may contain one or more elements.
This case is discussed further in “Further work”.

Case (4) covers a wide range of well-formedness and basic validity
problems. PYXup currently handles the following subcases, as well as (1) and (3), based on the
type of event, the current state and the contents of the pushdown stack
(examples are based on the 'abcd' doctype above):

upEnd

The event is a close event, and there is an automaton higher up the
pushdown stack which could accept it, that is, a matching open even has
occurred, as in (a (b )a.

The default fixup is to output a close event for the current automaton
(in the example, a )b) and return from it.

badEnd

The event is a close event, and there is no automaton higher up the
pushdown stack which could accept it, as in (a )d.

The default fixup is to discard the event.

upChild

The event is an open event, and there is an automaton higher up the
pushdown stack which could accept it, that is, an open event for an element
which could accept this as a child has occurred, as in (a (b (d.

The default fixup is to output a close event for the current automaton
(in the example, a )b) and return from it.

badChild

The event is an open event, there is no automaton higher up the
pushdown stack which could accept it, but the event's tag is known to be allowed as the child of at least one other element, as in (a (c.

The default fixup is to postpone the current event and process an
open event for the preferred parent element (if specified in the grammar) or an arbitrary
allowed parent element (otherwise) instead (in the example, this would be a (b).

badOrphan

The event is an open event but no known element allows it as a
child, as in (a (b (a.

The default fixup is to force acceptance, that is, to act as if
the current automaton were constructed from an element which did have the relevant element as a child.

upText

The event is a text event, and there is an automaton higher up the
pushdown stack which could accept it, that is, an open event for an element
which allows text has occurred, as in (a (b (c -text.

The default fixup is to output a close event for the current automaton
(in the example, a )c) and return from it.

orphanText

The event is a text event, there is no automaton higher up the
pushdown stack which could accept it, but at least one element is known to
accept text, as in (a -text.

The default fixup is to postpone the current event and process an
open event for the preferred parent element for text (if specified in the grammar) or an arbitrary
text-allowing parent element (otherwise) instead (in the example, this would be a (b).

badText

The event is a text event, but there is no automaton higher up the
pushdown stack which could accept it, and no text-allowing element in the grammar at all.

The default fixup is to force acceptance, that is, to act as if
the current automaton were constructed from an element which did allow text.

overrun

The token stream ends, but the pushdown stack is not empty, as in
the case where (a (b were the entire input.

The default fixup is to output a close event for the current automaton
(in the example, a )b) and return from it.

unknown

The start tag for a tag with no definition is seen, as in (a (x.

The automatic fixup is to synthesise an automaton for the tag which
allows all known tags as children (but not text).

The abstract syntax for the fixup language represents the defaults
given above as follows:

where %o is the current automaton's tag,
%p is the tag of the preferred parent for the current event and
%t is the tag of the preferred parent for orphan text.

Comparison with TagSoup

PYXup's default fixup as described above comes very close to
reconstructing TagSoup's built-in
'rectification'. The major differences I'm aware of are as follows:

PYXup has no support for what TagSoup calls 'restartable' tags. Adding provision for recording restartability information to the schema language and adding a
restart action to the fixup language should be possible.

Similarly, PYXup has no support for what TagSoup calls 'uncloseable' tags, that is, form and table. Exactly what the right generalisation is for this class is unclear to me.

PYXup has no support for what TagSoup identifies as 'cdata' tags, that is, script and style.

PYXup currently treats unknown elements as if they allowed any
known content, whereas TagSoup provides
command-line control of whether such elements are treated that way, or as if
they allowed no content. It would be straightforward to add this to PYXup.

TagSoup does some namespace fixup, but its results are not always
ideal. PYXup currently doesn't do any kind
of namespace fixup, so its output is not
guaranteed to be namespace-well-formed.

PYXup currently does not detect or eliminate multiple document elements, as it
were, whereas TagSoup guarantees to produce
a single document element, by treating the initial start tag specially: all
end tags which match the initial start tag
are ignored, thereby capturing all the input as children of the original start
tag, and requiring a final end tag for it to be synthesised at the end of the
token stream. It would be straightforward to add this to PYXup as a command-line option.

At the moment, when a given problem (e.g. non-allowed start tag) may
need differing fixups depending on context, PYXup handles this by dispatching
on a finite number of different keys which encode the context differences (e.g.
upChild, badChild, badOrphan). It might be possible to address the 'restartable', 'unclosable' and 'cdata' issues described in the list above by further
articulation of such keys, but it probably makes more sense to revise the rule language to allow for richer patterns which
could test for (boolean combinations of) a range of properties of the current event and the events on the pushdown stack.

Further work

From a
much more theoretical perspective, there is a body of work in formal language
theory about error correction, see e.g. [Fischer77], which deserves
to be explored for possible relevance.

For the time being PYXup discards comment, processing instruction and
attribute events -- obviously this needs to be fixed! A systematic approach to
namespaces is also needed.

The most important next step is to look at the kinds of fixup described
in the prose of [Hickson07] to see which ones are covered by the
default rules given above,
which ones could be covered with new rules in the existing language or the
extensions proposed above in section “Comparison with TagSoup”, which
ones would require other changes to the language and, of course, which ones cannot
be accommodated by the PYXup paradigm at all.

Concrete syntax

The XML languages for specifying the
schema and the fixup rules are both very simple. Only two elements are
involved for the schema:

Availability

Acknowledgments

Needless-to-say, I owe a substantial debt of gratitude to John Cowan,
not only
for the design of TagSoup, but for making
its implementation available. He is not, of course, responsible for any of the
changes and additions reported here.