Why use parse trees

Parse trees are an in-memory representation of the input with a structure
that conforms to the grammar.

The advantages of using parse trees instead of semantic actions:

You can make multiple passes over the data without having to re-parse the
input.

You can perform transformations on the tree.

You can evaluate things in any order you want, whereas with attribute schemes
you have to process in a begin to end fashion.

You do not have to worry about backtracking and action side effects that
may occur with an ambiguous grammar.

Example

Now that you think you may want to use trees, I'll give an example of how
to use them and you can see how easy they are to use. So, following with tradition
(and the rest of the documentation) we'll do a calculator. Here's the grammar:

integer
= lexeme_d[ token_node_d[ (!ch_p('-') >> +digit_p) ] ] ;

factor= integer| '(' >> expression >> ')'| ('-' >> factor) ;

term= factor
>> *( ('*' >> factor) | ('/' >> factor) ) ;

expression= term>> *( ('+' >> term) | ('-' >> term) ) ;

Now, you'll notice the only thing different in this grammar is the token_node_d
directive. This causes the integer rule to group all the input into one node.
Without token_node_d, each character would get it's own node. As you'll
soon see, it's easier to convert the input into an int when all the characters
are in one node. Here is how the parse is done to create a tree:

tree_parse_info<> info = pt_parse(first, expression);

pt_parse() is similar to parse(). There are four overloads:
two for pairs of first and last iterators and two for character strings. Two
of the functions take a skipper parser and the other two do not.

The tree_parse_info struct contains the same information as a parse_info
struct as well as one extra data member called trees. When the parse finishes,
trees will contain the parse tree.

long eval_term(iter_t const& i) {// ... see parse_tree_calc1.cpp for complete example // (it's rather similar to eval_expression() ) ...}

long eval_factor(iter_t const& i) {// ... again, see parse_tree_calc1.cpp if you want all the details ...}

long eval_integer(iter_t const& i){// use the range constructor for a stringstring integer(i->value.begin(), i->value.end());// convert the string to an integerreturn strtol(integer.c_str(), 0, 10);}

The full source code can be viewed here. This is part of the Spirit distribution.

So, you like what you see, but maybe you think that the parse tree is too
hard to process? With a few more directives you can generate an abstract syntax
tree (ast) and cut the amount of evaluation code by at least 50%. So
without any delay, here's the ast calculator grammar:

The differences from the parse tree grammar are hi-lighted in bold-red.
The inner_node_d directive causes the first and last nodes generated
by the enclosed parser to be discarded, since we don't really care about the
parentheses when evaluating the expression. The root_node_d directive
is the key to ast generation. A node that is generated by the parser inside
of root_node_d is marked as a root node. When a root node is created,
it becomes a root or parent node of the other nodes generated by the same rule.

To start the parse and generate the ast, you must use the ast_parse functions,
which are similar to the pt_parse functions.

tree_parse_info<> info = ast_parse(first, expression);

Here is the eval_expression function (note that to process the ast we only
need one function instead of four):

points to the final parse position (i.e. parsing processed
the input up to this point).

match

true if parsing is successful. This may be full (the
parser consumed all the input), or partial (the parser consumed only a portion
of the input.)

full

true when we have a full match (when the parser consumed
all the input).

length

The number of characters consumed by the parser. This
is valid only if we have a successful match (either partial or full).

trees

Contains the root node(s) of the tree.

tree_match

When Spirit is generating a tree, the parser's parse() member function will
return a tree_match object, instead of a match object. tree_match has three
template parameters. The first is the Iterator type which defaults to char
const*. The second is the node factory, which defaults to node_val_data_factory.
The third is the attribute type stored in the match. A tree_match has a member
variable which is a container (a std::vector) of tree_node
objects named trees. For efficiency reasons, when a tree_match is copied, the
trees are not copied, they are moved to the new object, and the source
object is left with an empty tree container. tree_match supports the same interface
as the match class: it has an operator bool() so you can test it for a sucessful
match: if (matched), and you can query the match length via the length() function.
The class has this interface:

When a parse has sucessfully completed, the trees data member will contain
the root node of the tree.

vector?

You may wonder, why is it a vector then? The answer is that it is partly
for implementation purposes, and also if you do not use any rules in your
grammar, then trees will contain a sequence of nodes that will not have
any children.

Having spirit create a tree is similar to how a normal parse is done:

tree_match<> hit = expression.parse(tree_scanner);

if (hit)process_tree_root(hit.trees[0]); // do something with the tree

tree_node

Once you have created a tree by calling pt_parse
or ast_parse, you have a tree_parse_info
which contains the root node of the tree, and now you need to do something with
the tree. The data member trees of tree_parse_info
is a std::vector<tree_node>. tree_node provides the tree structure. The
class has one template parameter named T. tree_node contains an instance of
type T. It also contains a std::vector<tree_node<T> > which are
the node's children. The class looks like this:

This class is simply used to separate the tree framework from the data stored
in the tree. It is a generic node and any type can be stored inside it and acessed
via the data member value. The default type for T is node_val_data.

node_val_data

The node_val_data class contains the actual information about each
node. This includes the text or token sequence that was parsed, an id
that indicates which rule created the node, a boolean flag that indicates whether
the node was marked as a root node, and an optional user-specified value. This
is the interface:

parser_id, checking and setting

If a node is generated by a rule, it will have an id set. Each rule
has an id that it sets of all nodes generated by that rule. The id is of type
parser_id. The default id of each rule
is set to the address of that rule (converted to an integer). This is not always
the most convenient, since the code that processes the tree may not have access
to the rules, and may not be able to compare addresses. So, you can override
the default id used by each rule by giving it a specific
ID. Then, when processing the tree you can call node_val_data::id()
to see which rule created that node.

structure/layout of a parse tree

parse tree layout

The tree is organized by the rules. Each rule creates a new level in the tree.
All parsers attached to a rule create a node when a sucessful match is made.
These nodes become children of a node created by the rule. So, the following
code:

When executed, this code would return a tree_match, m. m.trees[0]
would contain a tree like this:

The root node would not contain any text, and it's id would be set to the
address of myrule. It would have four children. Each child's id would be set
to the address of myrule, would contain the text as shown in the diagram, and
would have no children.

ast layout

When calling ast_parse, the tree gets generated differently.
It mostly works the same as when generating a parse tree. The difference happens
when a rule only generated one sub-node. Instead of creating a new level, ast_parse
will not create a new level, it will just leave the existing node. So, this
code:

will generate a single node that contains 'a'. If tree_match_policy
had been used instead of ast_match_policy, the tree would have looked
like this:

ast_match_policy has the effect of eliminating intermediate rule levels which
are simply pass-through rules. This is not enough to generate abstract syntax
trees, root_node_d is also needed. root_node_d
will be explained later.

switching: gen_pt_node_d[] & gen_ast_node_d[]

If you want to mix and match the parse tree and ast behaviors in your application,
you can use the gen_pt_node_d[] and gen_ast_node_d[] directives.
When parsing passes through the gen_pt_node_d directive, parse tree
creation behavior is turned on. When the gen_ast_node_d
directive is used, the enclosed parser will generate a tree using the
ast behavior. Note that you must pay attention to how your rules are declared
if you use a rule inside of these directives. The match policy of
the scanner will have to correspond to the desired behavior. If you
avoid rules and use primitive parsers or grammars, then you will not have
problems.

Directives

There are a few more directives that can be used to control the generation
of trees. These directives only effect tree generation. Otherwise, they have
no effect.

no_node_d

This directive is similar to gen_pt_node_d and gen_ast_node_d,
in that is modifies the scanner's match policy used by the enclosed parser. As it's name
implies, it does no tree generation, it turns it off completely. This is useful
if there are parts of your grammar which are not needed in the tree. For instance:
keywords, operators (*, -, &&, etc.) By eliminating
these from the tree, both memory usage and parsing time can be lowered. This
directive has the same requirements with respect to rules as gen_pt_node_d
and gen_ast_node_d do. See the example file xml_grammar.hpp (in libs/spirit/example/application/xml
directory) for example
usage of no_node_d[].

discard_node_d

This directive has a similar purpose to no_node_d, but works differently.
It does not switch the scanner's match policy, so the enclosed parser still generates
nodes. The generated nodes are discarded and do not appear in the tree. Using
discard_node_d is slower than no_node_d, but it does not suffer
from the drawback of having to specify a different rule type for any rule inside
it.

leaf_node_d/token_node_d

Both leaf_node_d and token_node_d work the same. They group
together all the nodes generated by the enclosed parser. Note that a rule should
not be used inside these directives.

This rule:

rule_t integer = !ch_p('-') >> *(range_p('0', '9'));

would generate a root node with the id of integer and a child node for each
character parsed

This:

rule_t integer = token_node_d[ !ch_p('-') >> *(range_p('0', '9')) ];

would generate a root node with only one child node that contained the entire
integer.

infix_node_d

This is useful for removing separators from lists. It discards all the nodes
in even positions. Thus this rule:

rule_t intlist = infix_node_d[ integer >> *(',' >> integer) ];

would discard all the comma nodes and keep all the integer nodes.

discard_first_node_d

This discards the first node generated.

discard_last_node_d

This discards the last node generated.

inner_node_d

This discards the first and last node generated.

root_node_d and ast generation

The root_node_d directive is used to help out ast generation. It
has no effect when generating a parse tree. When a parser is enclosed in root_node_d,
the node it generates is marked as a root. This affects the way it is treated
when it's added to the list of nodes being generated. Here's an example:

rule_t add = integer >> *(root_node_d[ ch_p('+') ] >> integer);

When parsing 5+6 the following tree will be generated:

When parsing 1+2+3 the following will be generated:

When a new node is created the following rules are used to determine how the
tree will be generated:

Let a be the previously generated node. Let b be the new node.

If b is a root node then

b's children become a + b's previous children. a is the new first child of b.

else if a is a root node and b is not, then

b becomes the last child of a.

else

a and b become siblings.

After parsing leaves the current rule, the root node flag on the top node
is turned off. This means that the root_node_d directive only affects the current
rule.

parse_tree_iterator

The parse_tree_iterator class allows you to parse a tree using spirit.
The class iterates over the tokens in the leaf nodes in the same order they
were created. The parse_tree_iterator is templated on ParseTreeMatchT.
It is constructed with a container of trees, and a position to start. Here is
an example usage:

rule_t myrule = ch_p('a');char const* input = "a";

// generate parse treetree_parse_info<> i = pt_parse(input, myrule);

typedef parse_tree_iterator<tree_match<> > parse_tree_iterator_t;

// create a first and last iterator to work off the treeparse_tree_iterator_t first(i.trees, i.trees.begin());parse_tree_iterator_t last;

advanced tree generation

node value

The node_val_data can contain a value. By default it contains a void_t,
which is an empty class. You can specify the type, using a template parameter,
which will then be stored in every node. The type must be default constructible,
and assignable. You can get and set the value using

ValueT node_val_data::value;

and

void node_val_data::value(Value const& value);

To specify the value type, you must use a different node_val_data
factory than the default. The following example shows how to modify the factory to store and retrieve a double inside each node_val_data.

access_node_d

Now, you may be wondering, "What good does it do to have a value I can
store in each node, but not to have any way of setting it?" Well, that's
what access_node_d is for. access_node_d is a directive. It
allows you to attach an action to it, like this:

access_node_d[...some parsers...][my_action()]

The attached action will be passed 3 parameters: A reference to the root node
of the tree generated by the parser, and the current first and last iterators.
The action can set the value stored in the node.

Tree node factories

By setting the factory, you can control what type of nodes are created and
how they are created. There are 3 predefined factories: node_val_data_factory,
node_all_val_data_factory, and node_iter_data_factory. You
can also create your own factory to support your own node types.

Using factories with grammars is quite easy, you just need to specify the factory type as explicit template parameter to the free ast_parse function:

node_val_data_factory

This is the default factory. It creates node_val_data nodes. Leaf
nodes contain a copy of the matched text, and intermediate nodes don't. node_val_data_factory
has one template parameter: ValueT. ValueT specifies the type
of value that will be stored in the node_val_data.

node_all_val_data_factory

This factory also creates node_val_data. The difference between it
and node_val_data_factory is that every node contains all the
text that spans it. This means that the root node will contain a copy of the
entire parsed input sequence. node_all_val_data_factory has one template
parameter: ValueT. ValueT specifies the type of value that
will be stored in the node_val_data.

node_iter_data_factory

This factory creates the parse_tree_iter_node. This node stores iterators
into the input sequence instead of making a copy. It can use a lot less memory.
However, the input sequence must stay valid for the life of the tree, and it's
not a good idea to use the multi_pass iterator with this type of node.
All levels of the tree will contain a begin and end iterator. node_iter_data_factory
has one template parameter: ValueT. ValueT specifies the type
of value that will be stored in the node_val_data.

custom

You can create your own factory. It should look like this:

class my_factory{public:

// This inner class is so that the factory can simulate // a template template parameter