7.3. XML::DOM

Enno Derkson's
XML::DOM
module is a good place to start exploring DOM in Perl.
It's a complete implementation of Level 1 DOM with a
few extra features thrown in for convenience.
XML::DOM::Parser extends
XML::Parser to build a document tree installed in
an XML::DOM::Document object whose reference it
returns. This reference gives you complete access to the tree. The
rest, we happily report, works pretty much as you'd
expect.

Here's a program that uses DOM to process
an XHTML file. It looks inside
<p> elements for the word
"monkeys," replacing every instance
with a link to monkeystuff.com. Sure, you could do
it with a regular expression substitution, but this example is
valuable because it shows how to search for and create new nodes, and
read and change values, all in the unique DOM style.

The first part of the program creates a parser object and gives it a
file to parse with the call toparsefile( ):

This method returns a reference to an
XML::DOM::Document object, which is our gateway to
the nodes inside. We pass this reference along to a routine called
add_links( ), which will do all the processing we
require. Finally, we output the tree with a call to
toString( ), and then dispose of the object. This
last step performs necessary cleanup in case any circular references
between nodes could result in a memory leak.

The add_links( ) routine starts with a call to the
document object's getElementsByTagName(
) method. It returns an
XML::DOM::NodeList object containing all matching
<p>s in the document (multilevel searching
is so convenient) from which we can select nodes by index using
item( ).

The bit we're interested in will be hiding inside a
text node inside the <p> element, so we have
to iterate over the children to find text nodes and process them. The
call to getChildNodes(
) gives us several child nodes, either in
a generic Perl list (when called in an array
context) or another XML::DOM::NodeList object; for
variety's sake, we've selected the
first option. For each node, we test its type with a call to
getNodeType and compare the result to
XML::DOM's constant for text
nodes, provided byTEXT_NODE( ). Nodes
that pass the test are sent off to a routine for some node massaging.

The last part of the program targets text nodes and splits them
around the word "monkeys" to create
a link:

First, the routine grabs the node's text value by
calling its getNodeValue( ) method. DOM
specifies redundant accessor methods used to get and set values or
names, either through the generic
Node class or through the more specific
class's methods. Instead of getNodeValue(
), we could have used getData(
), which is specific to the text node class. For some
nodes, such as elements, there is no defined value, so the generic
getNodeValue( ) method would return an undefined
value.

Next, we slice the node in two. We do this by creating a new text
node and inserting it before the existing one. After we set the text
values of each node, the first will contain everything before the
word "monkeys", and the other will
have everything after the word. Note the use of the
XML::DOM::Document object as a factory to create
the new text node. This DOM feature takes care of many administrative
tasks behind the scenes, making the genesis of new nodes painless.

After that step, we create an <a> element
and insert it between the text nodes. Like all good links, it needs a
place to put the URL, so we set it up with an href
attribute. To have something to click on, the link needs text, so we
create a text node with the word
"monkeys" and append it to the
element's child list. Then the routine will recurse
on the text node after the link in case there are more instances of
"monkeys" to process.

Does it work? Running the program on this file:

<html>
<head><title>Why I like Monkeys</title></head>
<body><h1>Why I like Monkeys</h1>
<h2>Monkeys are Cute</h2>
<p>Monkeys are <b>cute</b>. They are like small, hyper versions of
ourselves. They can make funny facial expressions and stick out their
tongues.</p>
</body>
</html>

produces this output:

<html>
<head><title>Why I like Monkeys</title></head>
<body><h1>Why I like Monkeys</h1>
<h2>Monkeys are Cute</h2>
<p><a href="http://www.monkeystuff.com/">Monkeys</a>
are <b>cute</b>. They are like small, hyper versions of
ourselves. They can make funny facial expressions and stick out their
tongues.</p>
</body>
</html>