In my pet project I have to parse different HTML pages from different sources and then extract meta-information from them (Open Graph, RDFa, Dublin Core, etc.) Currently I use closure-html together with xpath. But closure-html and cxml are very sensitive to any troubles with markup: even minor deviations cause errors.

What HTML parsers do you use? What would you advice to do to better extract meta-information from HTML pages?

Personally I use closure-html, but I'm fortunate in that the html I'm reading is well-formed (or at least closure-html isn't choking).html5lib has an actively maintained python parser for html5, I've been pondering whether you could run it directly with clpython, or burgled-batteries, or even writing a small python script to print the dom in lhtml and invoke it by SB-EXT:RUN-PROGRAM, but I'm not looking into it very quickly, because as I said above, chtml works for me so far

html5 explicitly steps back from xml features. It's something that I disagree with but they decided since 'nobody' was writing valid xml there was no point in supporting it. Specifically:

boolean attributes do not have a value, it's just the presence of the name, or its absence,

nodes like BR that have no children need no self-closing, and

nodes that CAN have children but DON'T in your markup (DIV etc) MUST have a closing tag (if you try the self-closing syntax from xml it gets treated as an opening tag)

And all that ignores the now-properly-defined algorithm for parsing bad markup in-general, which was why I linked the python library even though accessing it via some form of python bridge is quite funky. Parsing HTML properly is evil, but as neslepaks said there is now a CL port of it https://github.com/copyleft/cl-html5-parser