Parsing XML and HTML using Perl and LibXML

I used to parse HTML data using regular expressions and XML documents using xml parsers which normally parsed documents into arrays and hashes (key-value pairs).

But this time, I needed to retrieve only specific nodes of an XML document with some specific attribute. To do this, I could retrieve all nodes and then go through all records and use some conditions to get only those I am interested in or use some clever, modern solution. And this is where libXML stepped in.

PerlXML::LibXML is probably the most used xml parsing library for Perl and is based on libxml2. What is most interesting in this library, apart from its fast parsing, is the possibility to use XPath syntax. I have already seen this term but never actually used it in practice. It allows you to specify which elements you want to retrieve very easily.

As you can see here, in xpath query you can even use some builtin functions like contains(where, what). The list of builtin functions is here.

It’s always time-consuming and for some people unpleasant to learn new thing or new ways to do something, but I find it interesting. In the end, it may help you to do difficult tasks quicker and easier way. I hope that this article has shown you some new inspiration.

You can use libXML and Xpath in many other scripts and it is more or less the same. But after a few months of working with Perl and Ruby I can highly recommend Ruby from these two. Perl is faster – yes, but Ruby is simply more comfortable and less error-prone. Using Ruby from the beginning you can save a lot of heartaches. Here is LibXML for Ruby. If you are a complete beginner, look here.