I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.

Revisited! Please see the new article here.

My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.

QWebElement – Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)

htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.

Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.

Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.

My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.

Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.

To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)

30 Responses to “Parsing HTML with C++”

Hi Mark – The code is pretty much usable as is. To convert this into a complete example, all you need to do is throw Step 2 into a ‘main’ method (excluding the #include line, of course), with Step 1 code prepended to the top of the file.

my concern is that there is both bold and non-bold text. Then, there is some content plus a child. How is this compliant with XML? And how does one recognize separately the part of the content which is outside the inner tag ( in this case) and that inside it?

Ah, I see. With XPath, to select the text itself, you need to use the text() function. This is because XPath treats text as an additional node. This is untested, but I think if you want to select the “Text plus ” text, you can use:

terminate called after throwing an instance of ‘xmlpp::validity_error’
what():
Validity error:
Line 289, column 7 (error):
Value “center” for attribute align of img is not among the enumerated set
Line 296, column 20 (error):
No declaration for attribute data-form-id of element a
Line 569, column 3 (error):
Element img does not carry attribute alt

Looking at my project from when I first used this code, I never had to specify a DTD. Starting with an instance of the DomParser (xmlpp::DomParser doc;), I simply write doc.parse_memory(CleanHTML(response));. As long as you are first passing your HTML to Tidy to clean it up, I wouldn’t think xmlpp would throw any errors.

Hello and thanks for your tutorial ! I’m having an issue with TidyHTML though… I’m using it as a library, as you do, and I’m having trouble “tidying” those new HTML5 tags. , , and the like…
How could I overcome this ? Should I add those tags, and if yes, how would I do that ? Be as precise as you can please, for I am not a smart man 😀

Just a follow-up: according to the API documentation, tidyOptSetValue(tidyDoc, TidyBlockTags, “header”) should do the trick… But it’s not. What am I missing ? Is that a whole lot more complicated and do I also have to create those in tidyenum.h and tags.c ?

Okay yeah, definitely had to check my shit before making these comments. Good thing there’s moderation heh? Anyway, I was dumb, all the “new tags” should be put in the same spot separated by commas. Sorry for the spam – maybe you could add this how-to-add-new-tags thing in the article itself ?

Hi ! I have a question if you still look after this thread. I’m kind of a “noob” in this kind of programmation. So here’s my question… I have used libcurl to get an html file. It’s now in the root of the project I’m coding and I don’t understand how to use your code to just convert my (for example) “file.html” into a xhtml format that I can later parse. The fact is I don’t understand where or wich, in your program, file do you convert, and what kind of output you get. Do you get an url to put somewhere in the programm and get a straight string in return, or just an “.html” file that get converted ? That the part I’m stuck on. And where do I specify the file I want o be converted…
Hope you really can help me, thanks !

The first function I define, CleanHTML, takes the contents of the HTML file itself. So, you just need to open the file and read its contents into a string. Then pass that to CleanHTML, which will produce XML. From there, just use the code in “Step 2” to actually parse the XML into a tree that you can traverse.

I have one more question, but only dor the sake of my curiosity. What would happened if you would do something like CleanHTML(response) in the main method without the rest of step 2 and left it as is. Would anything happen ? If yes, when would the buffer clean itself ?

By that I mean look algorythmicaly into a wab page to find the desired information and then “extract it”. My programm need to change in serial the name of a list of files from a alphanumerical code to an actual “name”. And I know a site on which I can find the desired information from just typing the proper url with the alphanumerical code. I than want to take in a string the proper name to than rename the file. Hope this reply to your question.

tidyBufAppend( out, in, r ); is not sax style parsing, it’s just adding bytes to buffer;
in fact you can see then in main()
err = tidyParseBuffer(tdoc, &docbuf);
which will parse the whole preloaded buffer

Can you please upload the project and its dependent projects such as glibmm and the gtk includes for the all the msvc project to compile properly.
This would probably save everyone a huge amount of time gathering the right version from the internet.