Marpa resources

Wed, 07 Dec 2011

This is the second of a series of posts that details
a Marpa-based, "Ruby Slippers"
approach to parsing liberal
and defective HTML.
This post assumes you have
read
the first post.

First, reduce the HTML to a token stream

Most computer languages can be viewed
as a token stream.
HTML is not an exception.
HTML tokens can be blocks of text;
comments and various other SGML entities;
HTML element start tags;
and HTML element end tags.
The HTML token stream is unusual in that
some of its tokens can
be quite complex internally.

In parsing computer languages,
it is a frequent practice to divide the
work between a tokenizer ("lexer")
and a high-level parser.
The lexer takes the raw input
and turns it into a token stream.
Tokenizing HTML is a difficult job,
and one for which there is an excellent CPAN module:
HTML::Parser.
Marpa::HTML
relies on
HTML::Parser
to do its tokenization.

Marpa::HTML
determines the large scale structure of the HTML document --
what I will call in this post, "high-level" parsing.
The result of high-level parsing can be
seen as a hierarchical structure.
The goal of high-level parsing is
to build a hierarchy which reflects
the structure of the document.
Conventionally, this hierarchy is visualized
as an upside-down tree,
one where the "leaves" are the tokens,
and where the "root" node represents the
document as a whole.

An HTML document contains components
of many kinds,
but the hard part of determining its structure
is deciding where to begin and end its HTML elements.
Know this and you know how to arrange the HTML elements in
a hierarchy.
Know how to arrange the HTML elements in a hierarchy,
and you have produced essentially the entire parse
tree for the HTML document.

When it comes to knowing where to start
and end HTML elements
in liberal HTML,
there are complications aplenty.
Conceptually, every HTML element has a start
and end tag but,
even in fully standard-conforming input,
not all of these tags need to be in the physical input.
In real-life HTML, start and end tags are often
missing.
Additionally, a liberal HTML parser
is expected to be robust in
the face of spurious tags.

Second, do a census of the HTML elements

In HTML according to the standards,
there is a fixed list of HTML elements.
In theory, a liberal HTML parser could work
from a list which is a superset of all the standards,
plus a list of the known browser-specific tags.

Marpa::HTML
takes another approach.
It makes a first pass over its input to "census"
the elements in the HTML document,
and builds its grammar based on that census.
This has several advantages:

It is an easy way to deal with spurious tags.

Vendor-specific tags can handled using the same mechanisms
that deal with spurious tags.
This is reasonable since the boundary between vendor-specific
and spurious tags is not sharp.

The grammar avoids having lots of rules for
elements which do not actually exist in the input.

Third, create a grammar

Marpa::HTML
creates
a new grammar for every parse.
Much of the grammar is constant -- there is a "framework".
This framework already includes some of the HTML elements.
Into this framework,
Marpa::HTML
inserts rules for
the additional elements discovered during the
census.
Each element has only one rule,
which takes the form:

ELEMENT ::= START_TAG CONTENTS END_TAG

The grammar assumes that every HTML element
has both a start tag and an end tag.
From the grammar's point of view, these are never
optional.
Marpa::HTML's
grammar is therefore,
not just strict,
but over-strict.
If there were no Ruby Slippers,
the over-strict grammar
would fail to parse many
standard-conforming HTML documents,
and would be hopeless at dealing with liberal HTML.

Fourth, determine which tokens are allowed to be virtual

A necessary complement to
Marpa::HTML's
over-strict grammar
is a list of the possible "virtual"
tokens.
"Virtual" here means that the token is not present in the
physical input.
Virtual tokens are tokens that the Ruby Slippers logic
is allowed to invent
in order to make the over-strict grammar's wishes
come true.

We don't want all tokens to be allowed as virtual tokens.
For example, we don't want the Ruby Slippers logic
inventing text.
What we want the Ruby Slippers to do is to supply
just enough structure to make the HTML document
conform to the over-strict grammar.

The list of virtual tokens in
Marpa::HTML
consists entirely of element tags.
Every end tag is allowed as a virtual token.
This is necessary for liberal HTML parsing with
the over-strict grammar,
because it enables
Marpa::HTML
to supply a missing end tag for any HTML element.

For start tags,
the list of allowed virtual tokens is much shorter.
Four start tags are allowed as virtual tokens,
because the HTML standards allow them to be
missing,
and because they are needed
according to the over-strict grammar.
The four are
html,
head,
body,
and
tbody.
Marpa::HTML
goes beyond the standards in
using start tags to
repair a defective table.
The tags allowed as needed to repair a defective table are
table,
tr,
and
td.

Fifth, apply the Ruby Slippers

With all this set up,
we can do the actual parsing.
Marpa::XS
is used as the parse engine.
Almost certainly, the parse engine will soon
receive a token which is not acceptable to the over-strict grammar.
At that point the parse engine will complain
that it cannot continue.

Marpa::XS
allows the application to retry with another token.
Marpa::XS
also has
the
very special ability
to produce, at any point in the parse,
a list of those tokens that it is willing to accept.
Marpa::HTML
asks to see the list of acceptable tokens and compares it to its
list of virtual tokens.
If exactly one of the allowed virtual tokens is acceptable as input,
Marpa::HTML
invents a token of that type
and passes it on to
Marpa::XS.

When the physical input is standard-conforming HTML,
at every point where the over-strict grammar has a problem,
one and only one virtual token
will be acceptable as input.
However,
Marpa::HTML
is designed to handle
arbitrary input.
That means
Marpa::HTML
must deal with

Cruft: tokens that nothing can make right; and

Choices: situations where more than one
of the virtual tokens
is acceptable to the over-strict grammar.

Cruft

Sometimes a token appears in the physical input that nothing can
make right.
One example of cruft
is a head element start tag
inside an HTML
body element.
There cannot be more than one HTML
head element in any HTML document, so this token
has to be classified as cruft.
For these,
Marpa::HTML
has a special "cruft" pseudo-class.
The application is allowed to specify a cruft handler,
which can
throw the cruft away,
repackage it as a comment, etc., etc.
The default behavior is simply to pass cruft on, untouched.

When there is a choice of virtual token

If an HTML document is standard-conforming HTML, it
is not possible for more one virtual token to be acceptable
at any point.
In real life,
and therefore in parsing liberal HTML,
such things are not just possible,
but quite common.

Marpa::HTML
currently chooses
between virtual tokens
using rules of thumb.
These often involve tradeoffs.
For example,
when it comes to ending an element versus starting one,
the preference is, in general, to start one.
But there are exceptions to this.
One important exception is when starting an element
would lead to an infinite loop.
It would be nice if
Marpa::HTML's
behavior in dealing with choices among virtual tokens
was configurable -- as of this writing it is not.

Notes

"high-level parsing":
Actually, what I call "high-level parsing" in this
post is, in most textbooks, simply called parsing.
In this kind of context,
lexical analysis is usually considered distinct from
parsing.
Here, however, I have the problem that the lexing is
done by a module named
HTML::Parser.

Parsing is a loose term.
Lexical analysis is called parsing in some contexts,
so
HTML::Parser's
name is not actually wrong.
But it is certainly inconvenient,
when the context is a description of what
Marpa::HTML
does.
The solution adopted in
this discussion is to
make a distinction between
"low-level parsing" (lexical analysis or tokenization)
and "high-level" parsing (determining the
full structure of a document).

"XML":
Extension of
Marpa::HTML
to XML would be easy,
but I have not done it.
Marpa::HTML
is all about dealing
with the difficult and
varied HTML syntaxes,
and with liberal and defective input.
XML's syntax is well defined
and extremely easy-to-parse.
Furthermore,
a conforming XML parser is NOT allowed
to be liberal --
at the first encounter with less-than-perfect
input, a conforming XML parser
must produce an error message
and thereafter it may produce nothing
else but error messages.
When it comes to parsing perfect input
for an easy-to-parse and well-defined language,
the
Marpa::HTML
approach, frankly, offers
no advantage over the previous
state of the art.

"inventing text":
Actually, one can
imagine specific applications where filling
in missing content would be useful.
For example,
a site might have headers and/or footers
that need to be on every HTML page.
Or there could be obligatory disclaimers,
license information, etc.

"repair a defective table":
Repairing tables is an impressive trick,
but is not completely clear to me that this is the best
thing to do:
discarding stray table tags as cruft might be a better idea.
However, aggressive table repair is how
HTML::Tree
handles
things,
and in writing
Marpa::HTML
I decided to follow
its precedent.