html5lib Parser

lxml can benefit from the parsing capabilities of html5lib through
the lxml.html.html5parser module. It provides a similar interface
to the lxml.html module by providing fromstring(),
parse(), document_fromstring(), fragment_fromstring() and
fragments_fromstring() that work like the regular html parsing
functions.

Differences to regular HTML parsing

There are a few differences in the returned tree to the regular HTML
parsing functions from lxml.html. html5lib normalizes some elements
and element structures to a common format. For example even if a tables
does not have a tbody html5lib will inject one automatically:

Returns an HTML fragment from a string. The fragment must contain
just a single element, unless create_parent is given;
e.g,. fragment_fromstring(string, create_parent='div') will
wrap the element in a <div>. If create_parent is true the
default parent tag (div) is used.

If a bytestring is passed and guess_charset is true the chardet
library (if installed) will guess the charset if ambiguities exist.

fragments_fromstring(string, no_leading_text=False, parser=None):

Returns a list of the elements found in the fragment. The first item in
the list may be a string. If no_leading_text is true, then it will
be an error if there is leading text, and it will always be a list of
only elements.

If a bytestring is passed and guess_charset is true the chardet
library (if installed) will guess the charset if ambiguities exist.

fromstring(string):

Returns document_fromstring or fragment_fromstring, based
on whether the string looks like a full document, or just a
fragment.

Additionally all parsing functions accept an parser keyword argument
that can be set to a custom parser instance. To create custom parsers
you can subclass the HTMLParser and XHTMLParser from the same
module. Note that these are the parser classes provided by html5lib.