Extracting from HTML with Mojo::DOM

Everyone wants to parse HTML, and many people reach for a regular expression to do that. Although you can use a regex to parse HTML, it’s not as fun as my latest favorite way: Mojo::DOM with CSS3 selectors. I find this much easier than trying to remember XPATH and I get to play with Mojo.

The DOM is the “Document Object Model”. Something behind the scenes parses and organizes the information and allows me to query it with questions such as “find all the a tags inside a div tag”, or “find all the tags of a particular class”. I don’t manipulate the text myself.

If I’m using Mojo::UserAgent, I can get a DOM object from the response object from an HTTP request:

The Mojo method-chaining style with one method per line shows its strengths as I get into more complicated tasks later.

I don’t have to make a request to get a DOM object. I’m often presented with HTML files to parse with no server to give them to me. Depending on the tractability of the task, I might hand edit it to remove the parts I don’t want to think about then use a regex to handle the rest. That way, I don’t have to do a lot of work to save state and know where I am in the document. With a DOM, that’s not a problem.

In the first example, I fetched http://search.cpan.org/~bdfoy/', my author page at CPAN Search. I’ll start with that HTML, assuming I already have it in a string.

Once I have the $dom object, I can use find to select elements. I give find a CSS3 selector, in this case just a to find all the anchor links. find returns a Mojo::Collection object, a fancy way to store a list and do things do it. The Mojolicious style makes heavy use of method chaining so it needs a way to call methods on the result. In this example, I merely join the elements with a newline. These are the results:

Each element in the collection is actually a Mojo::DOM object. The first argument to map is the method to call on each element and the remaining arguments pass through to that method. In this case, I’m calling attr('href') on each object. Now I mostly have the values I want:

I can get even fancier. Instead of the distribution name with the version, I can break it up with CPAN::DistnameInfo. I’ll turn every found link into a tuple of name and version. Since that module wants to deal with a distribution filename, I tack on .tar.gz to make it work out:

Site Map

Contact Us

License

Legal

Perl.com and the authors make no representations with respect to the accuracy or completeness of the contents of all work on this website and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. The information published on this website may not be suitable for every situation. All work on this website is provided with the understanding that Perl.com and the authors are not engaged in rendering professional services. Neither Perl.com nor the authors shall be liable for damages arising herefrom.