A crash course on trees

There are many theoretical things to say about trees.
You are not going to hear them from me though.
I will spend some words to explain the basics, from a
pure utilitarian point of view.
For a more complete view on trees, see Sean Burke's
article included in HTML::Tree, available on CPAN.

Tree types

When we talk about trees, we often mix two concepts
both defined by the same word. A search tree, or
binary tree is a structure having two links per node, and
it is used to sort items, store them and retrieve them
efficiently. Then we have B-Trees, having more than
two links per node, commonly used to store information
(usually indexes) bigger than the available memory. They
use an algorithm to retrieve items from a disk and put them
into the main memory on demand. Both binary trees and B-trees
share the concept of a calculated insertion. A new node
is inserted in the appropriate part of the tree depending
on its value. These trees are indexing black boxes that
accept a new value and store it following some pre-determined
criteria and retrieve it quickly when required.
But we also have trees that don't insert nodes by their
internal engine rules, and allow you to create complex
structures to suit your needs. Trees of this sort don't
insert nodes in the most appropriate spot, don't keep
your nodes sorted, actually they don't know anything about
sorting nodes. However, they keep your nodes ordered.
What's the difference between sorted and ordered? Sorted
means that, given a rule, each node will end up in a
determined logical position, every time we insert it
into the tree. For example, given the rule "alphabetical
sort", a node with key "c" will be placed between two
nodes with keys "b" and "d". The physical position may
change, but the logical result will always be the same.
Ordered, on the other hand, could be sorted, but not
necessarily. We insert the nodes in the order we want,
and the tree structure will keep that order, rather than
trying to organize our nodes according to a given rule.
To understand the implication of what I am saying, let's
compare the features offered by some different data
structures.

Using a sorted array, we can retrieve elements quite
fast, by means of a binary search. However, inserting a new
element in a large sorted array is an expensive operation.
Using a hash, both inserting and retrieving are quite fast.
But the elements can't be kept ordered.
A linked list keeps a given order and inserts faster than
arrays, but searching is sequential, and can be as slow as
looking to all the elements.
A binary tree keeps the given order, allows fast insertions
and fast retrieving.
A DAG 1 tree, finally, keeps the given order, allows for fast
insertion, even though it doesn't have any mechanism for
fast retrieval. Why? What's the difference? The fast retrieval
in binary trees is due to the insertion mechanism, which inserts
the nodes according to sorting rules. Therefore, an inverse
rule allows the quick retrieval. DAG trees, instead, give you
freedom of insertion and you pay such freedom with less speed
in searching.

1 DAG stands for Directed Acyclic Graph, but I stated before that I wasn't going to tell anything theoretical here. So I am afraid that you should resort to the recommended reading.

Tree nodes

A tree is a funny thing, bearing semantics from comparison
to different pieces of reality. The name "tree" calls for a
botanical symbology, which breaks as soon as we realize that
Information Technology trees have their roots at the top.
However, following the same idea, we can talk about branches
and leaves to describe the forking of sub elements in the
tree.
Often, tree elements are described as having a parent (the
node above them) and children (the nodes below them).
In Tree::DAG_Node, the parent is called mother, and the
nodes below are called appropriately daughters.
Whatever we call them, tree elements have a link to two or
more nodes below them and a link to another node above them.
Such link could be unidirectional (from parent to child only)
or bidirectional (so the child can point to its parent.)
The lack of bi-directional link may be seen as a terrible
design blunder, since we can go from the top node down to
the lowest one, but we can't go back. Everything will become
clear when we remember that trees are recursive structures
and we access them exploiting this quality of theirs.

Traversing a tree

The most common way of accessing a tree is by recursively
accessing the subnodes of the starting node. For efficiency
sake, there are methods to avoid recursion, by means of more
fortified data structures, but the principle remains.
When we access all the nodes of a tree, we say that we
traverse a tree. Depending on what we do while traversing,
our walk can be called in-order, pre-order or post-order
traversal. The understanding of these concepts is crucial to
be able to do anything meningful to a tree.
In-order traversal is when we access the nodes in sort order,
and this usually applies to binary trees and B-trees.
The pre-order traversal means accessing a node, doing something
and then accessing the nodes below.
Post-order is when we access a node, recurse to the subnodes
and then do something.

pre-order (action: print the node name)
company, sales, net, store, R&D, research, development, pers.
Useful to print the company organigram, from top to bottom. Had
we udes post-order, we would have printed it upside-down.
post-order (action: sum the number of employees)
net, store, sales, research, development, R&D, pers, company.
Since the action is post-order, we can sum the employees in
each node to the ones in the parent. Had we done this in
pre-order, we would have given to "company" the employees
from "sales" before receiving the quantity from "net" and
"store" (i.e. when it was still zero).

What is a Tree::DAG_Node?

Tree::DAG_Node is a class to build generic trees.
This is opposed to some specialized trees, such as HTML::trees
or XML::Trees.
Strictly speaking, as the name suggests, DAG_Node is not a tree,
but a node. However, since trees are recursive structures, each
node contains a subtree. The DAG_Node class has all the methods
to deal with individual nodes or with the whole tree.
As my first programming book said, and the module author wants to
stress, a program is made of algorithms plus data structures
(N. Wirth, 1978). When these two elements are combined together
in an object, the result is a powerful engine.
Coming back to our DAG_Node, it is a quite rich module, having
a large collection of methods to manipulate both nodes and trees.
There are no specialized methods related to the node data, such
as the ones of HTML::Element. This lack of specialization doesn't
prevent DAG_Node from being a powerful tool even without any
addition.

Using Tree::DAG_Node

Using the base class directly

Here we create a node, the root of our tree.
The costructor doesn't need any parameter.
You can pass some options through a hash reference
although it doesn't currently do anything. It is there
for the descending classes benefit. After we create a
node, we can assign a name. No need to do it, although
for this example we are using names as the only visible
attribute of each node.

Create a new node, with the same procedure, and then
construct some sub nodes (daughters). The new_daughter
method is a constructor that creates an object of the
same class as the caller and adds it to the list of its
subnodes. Since the result of new_daughter() is an object,
we can apply the name() method to it.
After the above statements, $tutorials is a new tree.

This method prints a vertical representation of the tree,
improving readability for trees with long lists of daughters.
Check the resulting output. The whole representation of
PerlMonks structure wouldn't fit in this page.

Traversing a tree

The main method to traverse a tree is walk_down(\%options).
This method touches all the nodes in the tree, starting at
the calling node, and calls a sub for each one. The sub is
passed as a value in the \%options parameter. Depending on
the key name, it will be called before ('callback' =
pre-order) or after ('callbackback' = post-order) accessing
the sub-nodes.
You should provide at least one of them. Anything else you
define in the %options hash is passed to the sub as second
parameter. Especially useful is the "_depth" parameter, which
states the starting depth for the tree and it is incremented
at each level.

In this call, we define an anonymous sub to print each node,
with special treatment for the root. The sub is called for each
node, with the node as first parameter and an anonymous hash
containing '_depth' and 'treename' as second parameter.

One important caveat. The callback sub will be called as
long as it returns a true value. A false value will stop
the recursion. Useful when we are searching for something,
and we want to avoid unnecessary walking down after the
result has been achieved. Remember that a print statement
returns always true. If you want to stop a printing walk
down, you must return "0", an empty string or undef.
But we are not limited to using the walk_down() method.
Writing our own traverse routine is quite easy.

See Search for nodes below for an explanation of the
address() method.
If we want a post-order traversal in the above routine,
then it's enough to move the print statement after the
recursive call to traverse.

Node attributes

What we have seen so far is not extremely useful. Playing
with the node name might be thrilling as long as we limit
our programming efforts to drawing ascii trees and counting
nodes. But in the real world we need to assign more content
to our nodes. To achieve this goal, we could either inherit
the base class, or use the attributes() method, which gives
us access to a node field, where we can store anything we
need. By default, the internal attributes field is set to an
empty anonymous hash, but we can set it up to pretty much
anything we like, as long as it's a reference.

Of course, the only way to access these attributes is through
the attributes() method itself. If we want to use an Object
Oriented interface, we need to inherit from Tree::DAG_Node,
or to store an object into attributes().

Walking around the tree

Traversing a tree is one of the most common tasks, but not
the most flexible one, when we need to access specific nodes
relative to others.
Tree::DAG_Node offers a large variety of node walking methods,
to go from one node to its neighbours.

Down among the daughters, we can identify them by their position
relative to a given node.
Every node can access its parent by calling $node->mother.
The complete list of the node ancestors is at our disposal
by using the appropriate method. If we access the node named "7",
for example, its ancestors will be the nodes named 'z', '2','a','root'.
If we want to do something to all the nodes in a subtree, without
bothering about using walk_down(), we can use the descendants()
method, which returns a list of all the nodes below the calling
one. If we want to include the caller, then self_and_descendants()
will give us the entire subtree.
Let's say that $node1 represents the node named '1'.

Check the reference documentation for more methods to move from
node to node.

Inheriting the base class

To inherit from Tree::DAG_Node, we must remember that the
class constructor accepts an optional parameter, an
anonymous hash. This parameter is not used by the base
class constructor, but it's there for the convenience
of the descending classes. It means that if we pass
parameters to the new() constructor through this
anonymous hash, we can use the other constructors without
modification. The new_daughter() and new_daughter_left
constructors will call new(), passing the anonymous hash
that they receive as an argument. The bottom line is, if
we can live with the idea of passing parameters through a
hash reference, we only need to modify one constructor,
otherwise we must rewrite all three of them.
In our example, we agree to the convention and we
override the main constructor only.

This new() constructor simply passes the $options hash
reference to the internal attributes(). Actually, nothing
prevents us from using a different reference, say an object
instead of a hash, but it will do for now. It's worth noting
that the author of the base class marks this construction as
a bad thing. While I agree in principle, my Impatience
took me to embracing this shortcut.
Next, we add three more methods, two being a simple
interface to the object's internal data, while the third
one adds some general functionality to the base class.

The by_name() method walks down the tree starting at the
calling node, and returns a list of all the nodes matching
the given name. In scalar context, only the first matching
node is returned. If the name was not found, returns undef.

Finally, we prepare three mode methods that work
specifically with this particular class, using the tree
for calculations and showing results.

clear_totals() sets to 0 the values for employees and
budget in each node that has subnodes, thus preparing the
way for the next method. sum_up() gets the total values for
employees and budget from each subnode, recursively, using
a post-order traversal. The total value pops up to the root
by virtue of this engine. Eventually, print_wealth() shows
the amount of employees and budget for each node and for
the whole company.
Now, let's see an application of this new package.

The new company is created using the derived class. Then,
the new_daughter() constructor is used, with an anonymous
hash as a parameter, that gets passed to the main constructor.
Soon we have three departments, two of which have a few sub
departments. We initialize to 0 the number of employees in
the non-terminal nodes, to prepare for the summing up.
The sum_up() method performs a post-order traversal, to
collect the totals. Finally, a pre-order traversal with
print_wealth() gives us a nice printout of the company
strength.

Searching for nodes

In one of the previous examples, we met the address()
method. This function returns a string describing the
position of a given node within the tree. "0" is the root.
"0:0" is the first daughter, "0:1" the second and so on.
This address is not a static attribute, but it is calculated
on demand. One interesting feature of this method is that
if we pass a valid address, it will return the
corresponding node.
Thus, in the latest example, 'company' has address '0',
'sales' has '0:0', 'R&D' has '0:2' and 'dev' has '0:2:1'.
The address for a given node is always the parent's
address plus the position of that node within the parent's
daughters' list.
If we can figure out the address of a node, we can get the
relative node without traversing the tree.

However, calling nodes by their address is not always
convenient, since humans handle names more easily than
numbers, and since the address can change if we insert
nodes before the one we are looking for.

In the previous section we have seen a sub to retrieve
a node by name. It works fine as long as our name is
unique. If not, we should build a more selective method
that will find exactly what we need. Let's assume that
all our nodes have attributes() set as a hash ref, and
each one has a key holding a unique identifier.
To search for this unique ID is not so hard.

This instruction will return the node with attributes
containing { ID => 'nutcracker'}, or undef if not found.
For more complex conditions, when your nodes contain
objects, remember to check the type of attributes (using
ref), to avoid run-time exceptions.

Modifying the tree

One of the advantages of trees over less flexible data
structures is their ability of moving subtrees around
with minumum effort. To show how Tree::DAG_Node achieves
this goal, we'll see some of the moving routines available.
Let's build a sample tree, first.

Our first task is to remove the node named '2' and assign
its daughters to the root. We identify the node by its
address, and then use the method replace_with_daughters().
The effect of this method is to unlink the node from its
mother and to add all its daughter in its place, moving
to the right all existing right nodes.

Notice that the address of the node to move, that was
'0:2' in the original tree, is now '0:4', due to the
insertion of new nodes on its left.
Calling add_daughter() with $node as an argument causes
the link $node->mother to be cut, and a new one to be
created in its stead. If we just want to get rid of that
subtree, the method $node->unlink_from_mother will do.
We should just take care of the memory still occupied
by this subtree, and call $node->delete_tree.

Conclusion

So many things, so little space (and time!)
I could go on for several pages, without exhausting the
subject1, which is indeed rich and fascinating.
The purpose of this tutorial, however, is only to give
you a general view of the module. Once you are confident
with its main issues, you will find the reference manual
easier to read.
One last recommendation, before I leave you. Trees are
memory consuming, and unlike other data structures, they
are not cleaned up by Perl's garbage collector (GC).
This is due to the many references held in each node, which
prevent the GC from sweeping away your tree once it goes
out of scope. Therefore, if you don't need the tree any
longer, call the delete_tree() method, to let the GC do
its job.

1 But I might have exhausted the reader!

General information - Recommended reading

The module documentation
In the module itself. Mandatory, I would say.
There are many juicy methods waiting for you to find them.
This tutorial was only a glimpse. I skipped many details
that you would like to explore.
Pay particular attention to the "See also" section, where
you'll find valuable recommendations.

Mastering Algorithms with Perl
by Jon Orwant, Jarkko Hietanieni, John MacDonald
(www.oreilly.com)
For the theory behind the babbling in this article, MAwP
is a must.

Object Oriented Perl
by Damian Conway (www.manning.com/conway).
Is the most authoritative work on Perl object oriented
methodology. Tree theory and implementation
is discussed passim with interesting examples.

update
Amended the post-order traversal, that I reported from the wrong example. Thanks ChemBoy.

Excellent node! But a minor quibble: my memories from my intro data structures course are not the strongest, but they suggest that your definition of a post-order traversal is slightly flawed: the Left-Right order of the leaf nodes should be preserved (that is, net should still precede store, and research precede development).

Not a major issue, of course, for an org chart--but it could be unfortunate if that happened to an expression tree. :-) (thanks to maverick for the link, and for confirming that I'm not totally nuts.)

If God had meant us to fly, he would *never* have given us the railroads.
--Michael Flanders

You are right, of course. Thanks for pointing it out.
I absentmindedly copied the sequence from the result of an early script where the nodes were in that order. Using the same structure that we have in the company example, the post-order traversal is correct. Just change "callback" into "callbackback" and the printout will be: