Wednesday, March 18, 2015

<kmc> maybe the whole project needs a better name, idk
<Ms2ger> htmlparser, perhaps
<jdm> tagsoup
<Ms2ger> UglySoup
<Ms2ger> Since BeautifulSoup is already taken
<jdm> html5ever
<Ms2ger> No
<jdm> you just hate good ideas
<pcwalton> kmc: if you don't call it html5ever that will be a massive missed opportunity

By that point we already had a few contributors. Now we have 469 commits from
18 people, which is just amazing. Thank you to everyone who helped
with the project. Over the past year we've upgraded Rust almost 50 times; I'm
extremely grateful to the community members who had a turn at this Sisyphean
task.

Several people have also contributed major enhancements. For example:

Clark Gaebel implemented zero-copy parsing. I'm in the process of reviewing
this code and will be landing pieces of it in the next few weeks.

Josh Matthews made it possible to suspend and resume parsing from the tree sink.
Servo needs this to do async resource fetching for external <script>s of the
old-school (non-async/defer) variety.

Chris Paris implemented fragment parsing and improved serialization. This means
Servo can use html5ever not only for parsing whole documents, but also for
the innerHTML/outerHTML getters and setters within the DOM.

Adam Roben brought us dramatically closer to spec conformance. Aside from foreign
(XML) content and <template>, we pass 99.6% of the html5lib tokenizer and tree
builder tests! Adam also improved the build and test infrastructure in a number
of ways.

I'd also like to thank Simon Sapin for doing the initial review of my code, and
finding a few bugs in the process.

html5ever makes heavy use of Rust's metaprogramming features. It's been
something of a wild ride, and we've collaborated with the Rust team in a number
of ways. Felix Klock came through in a big
way when a Rust upgrade
broke the entire tree builder. Lately, I've been working on improvements to
Rust's macro system ahead of the 1.0
release, based
in part on my experience with html5ever.

Even with the early-adopter pains, the use of metaprogramming was absolutely
worth it. Most of the spec-conformance patches were only a few lines, because
our encoding of parser rules is so close to what's written in the spec. This
is especially valuable with a "living standard" like HTML.

The future

Two upcoming enhancements are a high priority for Web compatibility in Servo:

Character encoding detection and conversion.
This will build on the zero-copy UTF-8 parsing mentioned above. Non-UTF-8 content
(~15% of the Web) will have "one-copy parsing" after a conversion to UTF-8. This keeps the
parser itself lean and mean.

document.write support. This API can
insert arbitrary UTF-16 code units (which might not even be valid Unicode) in the
middle of the UTF-8 stream. To handle this, we might switch to
WTF-8. Along with document.write we'll start
to do speculative parsing.

It's likely that I'll work on one or both of these in the next quarter.

Servo may get SVG support in the near future, thanks to
canvg. SVG nodes can be embedded in
HTML or loaded from an external XML file. To support the first case, html5ever
needs to implement WHATWG's rules for parsing foreign content in HTML. To
handle external SVG we could use a proper XML parser, or we could extend
html5ever to support "XML5", an
error-tolerant XML syntax similar to WHATWG HTML. Ygg01 made some progress
towards implementing XML5. Servo would most likely use it for XHTML as well.

Improved performance is always a goal. html5ever describes itself as
"high-performance" but does not have specific comparisons to other HTML
parsers. I'd like to fix that in the near future. Zero-copy parsing will be a
substantial improvement, once some performance issues in
Rust get
fixed.
I'd like to revisit SSE-accelerated
parsing as well.

I'd also like to support html5ever on some stable Rust 1.x
version, although it probably
won't happen for 1.0.0. The main obstacle here is procedural macros. Erick
Tryzelaar has done some great work recently with
syntex,
aster, and
quasi. Switching to this ecosystem will
get us close to 1.x compatibility and will clean up the macro code quite a
bit. I'll be working with Erick to use html5ever as an early validation of his
approach.

The C API for html5ever still builds, thanks to continuous integration. But
it's not complete or well-tested. With the removal of Rust's
runtime, maintaining the C API
does not restrict the kind of code we can write in other parts of the parser.
All we need now is to complete the C
API and write tests. This would
be a great thing for a community member to work on. Then we can write bindings
for every language under the sun and bring fast, correct, memory-safe HTML
parsing to the masses :)