Marpa resources

Thu, 13 Sep 2012

This post is about
html_fmt,
a
Marpa-based
reformatter ("tidier") for liberal HTML.
html_fmt
indents HTML according to the structure of the document,
which makes the HTML a lot easier to read.
In the process
html_fmt
adds missing start and end tags and identifies "cruft".

html_fmt
is ultra-liberal about its input.
Like a browser's rendering engine,
html_fmt
never rejects a file,
no matter how defective it is as an HTML document.
An interesting experiment would be to compare what your
favorite browser does with a random text file feed to
it directly,
with what it does to the same file
after it has been passed through
html_fmt.

html_fmt
is a by-product of moving
this blog to Github.
In the course of bringing over
my old posts,
I wanted a filter that would tidy them up,
so I turned to an old demo script I'd written.
The old demo's usefulness was a pleasant surprise,
but it lacked two features.
First, it wouldn't read from standard input.
Second, in formatting the HTML, it introduced new whitespace.
The first problem was easy to fix.
Fixing the second involved coming up with a
"lowest common denominator" for whitespace treatment
among browsers and HTML variants.

The result,
html_fmt,
works very well as the first step in dealing with HTML
that you are rewriting by hand.
One quick pass-through and your file is much easier to read,
has all the proper tags,
and has comments pointing out any "cruft" that's there.

A production quality "tidier" would need to be something like
gnuindent
--
bristling with options.
html_fmt
so far has only two options,
one dealing with whitespace before end tags,
the other allowing
a choice of strategies for avoiding added whitespace.
(One strategy uses comments, while the other simply leaves
the whitespace-sensitive locations as-is.)
These two options are not nearly
sufficient to deal with the full
range of whitespace issues,
never mind anything else.

But from a
"Worse is Better"
point of view,
html_fmt
is a good start.
It is 600 lines,
short enough to find your
way around in,
particularly once you've deleted the parts you don't like.
And its underlying Marpa-based interface is documented:
Marpa::R2::HTML.
Marpa::R2::HTML is beta, but has been stable for some time.