Variable references

Let's say I have a block of text and an environment consisting of a
mapping between (text) variable names and (text) values, and that I
want to go through the block locating variable references and
replacing them with the value of the variable from the environment.
Further, let us say the variable references are similar to make's:
$(variable). In other words, the variable name is
surrounded by parentheses, with a leading dollar sign.

Now, it would be simple to scan through the text looking for strings
like "$*" and when finding one, scan forward for the next ")".
However, this fails on nested evaluations like:

some text $(a$(variable)) some text

with:

"variable" => "Var"
"aVar" => "some text"

(I am assuming that the values do not have references in them, which
was correctly handled in my original problem. However, allowing them
does not significantly distort my final conclusion.)

To handle that while scanning forward, I would need to have some
bizarro, recursive scanning and evaluation scheme that gets ugly and
complex quickly. As I was looking at this problem (there is no real
"let's say" here; this is exactly what I was trying to do), I realized
it became much simpler if I did the original scan from right to left,
from the end of the text block towards the beginning. Then, finding
the first "$(" leaves the state looking like:

some text $(a$(variable)) some text
^

Subsequently finding the first ")" results in:

some text $(a$(variable)) some text
^---------^

which is easily replaced:

some text $(aVar) some text

Repeating that process clearly and easily produces the correct
evaluated text (assuming I do not have recursive variable
definitions):

some text $(aVar) some text
^
^-----^
some text some text some text

Further, I realized that, if I kept track of the location of the "$(",
I would not need to re-scan the text block repeatedly. Given that I
have found the right-most "$(", and that the expansion does not
contain "$(", I can just continue moving left from my original
location:

To handle variable values which could include references (such as
"aVar" mapping to "a$(variable)"), the algorithm would need to change
so that the rfind("$(",...) started at the end of the replacement
rather than the beginning of the replacement (i.e. lpos +
result.length, rather than lpos). However, that algorithm would not
catch non-terminating recursive variable references, which are handled
elsewhere in this algorithm. (Specifically, get() evaluates
variable references in a result it finds before returning the result.
Circular references are its responsibility.)

I wanted to parse the dang thing with a framework based on flex and
Bison. (Actually, a C++ translation of my famous and award-winning
FlexBisonModule framework things.)

The XML taggy things make the configuration format context
sensitive.

To see the context sensitivity, consider that <a> is closed by </a>;
other tags can be nested in between, so the structure is determined by
the "a" tag names, which would be hard to do using Bison given that
there is no a priori list of tags. (Hard to do, as far as I could
tell, anyway.)

But, I suddenly realized, I can punt! Instead of trying to recover
the structure while in the Bison parser, I could simply treat "<...>"
and "</...>" the same as any other directives, parse the file as a
simple list of directive lines, and then recover the tree structure in
a pass over the parse result.

I will pass over the parsing process here; suffice it to say that it
results in a parse tree consisting of a top element which contains a
sequence of directive elements:

Then, when trying to determine how to recreate the nested, tree
structure, I realized it was a variation on the same algorithm as
variable expansion above: Go backward through the list to find the
first open-tag-directive, then go forward to the first
closing-tag-directive; it should match the open-tag. Then, take the
directives in between and insert them as children of the open-tag and
remove them (and the close-tag-directive) from the list. Repeat as
needed.

This code is part of a method of the Lines class, which is created and
filled in by the parser; it contains all of the directive lines from
the file. The sym_children attribute is part of the Symbol class in
the parsing framework; all Symbols can have sym_children; Lines is a
Symbol.

The code uses several STL algorithms with functors; the first is
_is_unstructured_open:

The parsing framework makes heavy use of TR1 shared_ptr
reference-counted smart pointers; dynamic_pointer_cast is used to
down-cast those shared_ptr's. I typically define a type synonym for
shared_ptr's to a class in the class as "ptr".

The structured() method returns true if the Sect_Open object already
has children; in other words, if its structure has already been
recovered, if it has already been handled by an iteration of this
code. This test is needed if the search for an open-tag begins at the
end of the lines list every time in order to skip over
already-examined open-tags; while the final algorithm presented here
does not do that, it was useful during development.

The _is_structured_open functor thus returns true if the Symbol being
examined is a Sect_Open, and if it has not already been processed.

The iterator ibase, in the original code, represents a forward
iterator constructed from the backward iterator i and points to the
next element after i.

The iterator j should then point to the close-tag matching open (and
thus i) by name. To find j, the find_if algorithm is used again,
along with the _is_close functor:

The append() method accepts a pointer to a symbol and adds the pointer
to the Lines object's children.

At this point, there are two references to the child-directives: that
in the new Lines object and that in the original overall Lines. To
finish recovering the structure, I remove the contained directives
from the original:

sym_children.erase( ibase, ++j );

The ibase iterator is the first element after the open-tag; j is the
close-tag. By incrementing j, the argument to erase() is the element
after the close-tag; by the magic of reference-counted shared_ptr's,
this cleanly deletes the close tag along with removing the extraneous
references to the contained directives that are cluttering up the
structure.

Parsing Python-like text

Another task from the same project called for parsing a Python-like
domain specific language; again I wanted to use flex and Bison, and
again discovered the same algorithm.

Python (and the DSL) uses indentation to structure blocks in the
program text. The Python interpreter (and my initial parsing idea)
uses synthetic BEGIN and END tokens inserted when the indentation
increases or decreases. These are hard to do, however, with flex.
So, I once again decided to punt.

I am parsing the whole DSL text as a sequence of lines/statements.
The grammar has a rule similar to:

How to recover the blocks (or "suites", following Python)? My parsing
toolkit includes location for every token read; this includes the file
name, line number, and character position within the line, both for
the beginning and ending of every token. It also includes the same
information for symbols; the locations are updated when a token or
symbol is added as a sub-tree of another symbol. For example, the
code for the if_stmt rule is:

$$ = push_symbol($1, $2);

In this code, $1 is the IF token and $2 is the expression; push_symbol
adds the expression to the list of child sub-trees of the IF token
(tokens are symbols, too). This process updates the location for the
IF symbol to include the expression; the result will be something like
"file 'text-1.txt', line 50, char 4 - char 20".

The final result of Bison is a flat parse tree: the top-level Elements
symbol contains a flat sequence of elements. Recovering the
indentation-based block structure is the responsibility of the
algorithm:

[Note to self: add typedefs so that "list<...>::reverse_iterator"
isn't needed.]

Once again, there are a small stack of helping functors. In this
case, _accepts_suite is a test of whether a symbol is one which
accepts a sub-block (If, Elif, Else, etc.) or not (an Assignment, for
example):

The result of using this algorithm is that sub-suites are first moved
into a suite under their element, then that element is moved into a
suite under its element, and so on. The final result is the
appropriate parse tree.

One unfortunate result is that nonsensical text is accepted:

else:
n = 1
elif (foo < 1):
m = 4

This is handled by the higher-level interpreter, which when seeing
garbage like that throws a syntax error. As an alternative, another
pass could be made over the parse tree allowing only legitimate
structures.

Generalization

All three of these functions could be unified in a higher-level
template function, I believe. I have not yet done it, however