It is possible to create a simple parser using Lex alone.
by making extensive use of the user-defined states (ie start-conditions).
However, such a parser quickly becomes unmaintainable, as the
number of user-defined states tends to explode.

Once our input file syntax contains complex structures,
such as "balanced" brackets, or contains elements
which are
context-sensitive, we should
be considering yacc.

"Context-sensitive" in this case means that a word or symbol
can have different interpretations, depending on where it
appears in the input language. For example in C, the '*' character is used
for both multiplication, and to specify indirection (ie to dereference
a pointer to a piece of memory). It's meaning is "context-sensitive".

A LABEL
is either an identifier (loosely speaking) or
an arbitrary string in quotes, or any alternating
sequence of these two things

An EXEC token is identified by the
fact that:

it is not a keyword.

it appears after we have scanned a LABEL token on the
same line.
This is achieved by setting the start-condition ACT when
we scan the LABEL token.

We use lex's "first match" rule to ensure that
keywords get priority over the corresponding LABEL and EXEC interpretations,
and that the EXEC interpretation gets priority over the
LABEL interpretation in the state ACT.

It is thus essential that, where a keyword may appear on a line,
the length of the other rules (for LABEL or EXEC) be no longer
than the keyword rule. Otherwise, lex's "longest match" rule would
override the "first match" rule.

The EXEC token is constructed using the exclusive
start-condition CMD together with yymore()
These rules allow the
command to be extended across multiple lines, if the line(s)
end in a backslash. Using yymore() in this fashion
ensures that our arbitrarily long command-string does not
override the LABEL or keyword tokens by virtue of lex's
"longest match" rule.

The last newline after the EXEC token is not
appended to the command string, but returned separately.
Note that the rules
<CMD>\\\n { ... yymore(); } <CMD>.$ { ... return(EXEC); }are treated as being the same length, so it is important that
they appear in correct order.

Please ignore the variables
yylval and yylloc for now.
Their meaning will only become clear after we've started
looking at yacc in detail.

Likewise, the lex rules associated with the start-conditions
ENV, ENV1, ENV2 are not part of the basic scanner,
and will be covered later.

At this point, it's probably a good idea to test the lexer
"stand-alone". However, there's still one thing missing: tokens.

We have included a lot of return(XXX) statements, but have not
defined the tokens in the brackets. A token is simply a unique
integer which yacc associates with an item that the lexer
has found. Within yacc, all tokens must be declared in the
"Yacc Declarations" section of our yacc specification, like this:

%token TITLE
%token MENU
...etc

Yacc will choose a suitable integer values (>=256) for
lex to use. (see also: "Token Types")
Bison will also let you choose a value manually, like this:

%token END 999

Normally, we let the parse choose values, and that's one less thing
to worry about.

Part of the lex-yacc integration is that yacc will
generate a suitable set of token-definitions for lex to use.
Yacc does so by generating a file y.tab.h
(bison generates basename.tab.h ) which contains the
token-definitions and looks like this:

#define TITLE 258
#define MENU 259
...etc

This also keeps the lex-yacc communications automatically
in step with each other.

We don't need the whole yacc specification to do this.
We can get by on a something like the file
olmenu_tokens.y

Yacc rules define what is a legal sequence of tokens
in our specification language. In our case, lets look at
the rule for a simple, executable menu-command:

menu_item : LABEL EXEC
;

This rule defines a non-terminal symbol, menu_item
in terms of
the two tokens LABEL and EXEC. Tokens
are also known as "terminal symbols", because the parser does
not need to expand them any further. Conversely, menu_item
is a "non-terminal symbol" because it can be expanded into
LABEL and EXEC.

You may notice that I'm using UPPER CASE for terminal symbols (tokens),
and lower-case for non-terminal symbols. This is not a strict
requirement of yacc, but just a convention that has been
established. We will follow this convention throughout our
discussion.

We've
just hit our first complication: Any given menu-item
may also have the keyword DEFAULT appear between the
label and the executable command. Yacc allows us to have, multiple
alternate definitions of menu_item, like this:

menu_item : LABEL EXEC
| LABEL DEFAULT EXEC
;

Note that the colon (:) semi-colon (;)
and or-symbol (|) are part of the
yacc syntax - they are not part of our menu-file definition.
All yacc rules follow the basic syntax shown above and must
end in a semi-colon.
We've put the semi-colon on the next line for clarity, so that
it does not get confused with our syntax-definitions. This is not
a strict requirement, either, but another convention of style that we
will adhere to.

Note also that the word DEFAULT appears litterally,
not because it is a keyword in our input-language, but because we have
defined a %token called DEFAULT,
and the lexer returns this token when
it finds a certain piece of text.

The comment /* empty */ is ignored by yacc, and
can be omitted, but again, it is conventional to include it for
any empty rules.

Strange as it may seem, the absence of the keyword
DEFAULT is also a valid rule!
Yacc acknowledges the empty rule for "default"
when it sees it's current look-ahead token is EXEC,
and not DEFAULT.
See the section
"Look-Ahead"
in the Bison documentation for more information about "look-ahead".

To understand why this 2nd approach might be
considered better than our earlier one, we need to explore
Yacc Actions.

So far, we have only considered the tokens
LABEL and EXEC
as single-valued integers which are passed from the lexer to the
parser. What we really need, is access to the text-strings associated
with these tokens (ie their
semantic value).

We could do this using a global variable (like token_txt
in our spam-checking
program), except that yacc executes the action after it
has read all the tokens up to that point.
Hence the string value for EXEC would overwrite
the one for LABEL before we had a chance to use it.
We could use seperate global variables for the
LABEL and EXEC strings, but this
won't always work, because sometimes yacc has to read a token
in advance before it can decide which rule to use.

Consider the MENU keyword, in our case.
Yacc has to check whether
it is followed by another string or a newline, before it can decide whether it
is being used to introduce a sub-menu within the same file, or
an external menu-file.

In any case, yacc provides a formal method for dealing with
the semanitic value of tokens. It begins with the
lexer. Every time the lexer returns a value, it should also set
the external variable yylval to the value of the
token. Yacc will then retain the association between the token
and the corresonding value of yylval.

In order to accomodate a variety of different token-types,
yylval is declared as a union of
different types.

Token types are declared in yacc using the yacc declaration
%union, like this

%union {
char *str;
int num;
}

This defines yylval as being a union of the
types (char*) and (int). This is a classical
C-program union, so any number of types may be defined, and the
union may even contain struct types, etc.
For now, we'll just have these two types.

We also need to tell yacc which type is associated with
which token. This is done by modifying our %token declarations to
include a type, like this:

%token <str> LABEL
%token <str> EXEC
%token <num> INT

We do not need to modify any other %token declarations, because
they are all for keywords, which do not require any associated value.

Now we need to modify the lexer. This is done by including the line:

yylval.str = strdup(yytext);

just before the lexer returns the LABEL and EXEC
tokens. We'll also include the line:

Now that we have the token value, we want to make use of it.
Yacc lets us refer to the value of a given token
using a syntax similar to that of awk and perl.
$1 is the value of the 1st token, $2 is the 2nd, and so on.
Here is a typical example of an action:

new_item(); is a function which allocates
some memory for the
structure itemptr.

itemptr->label and itemptr->command
are of type (char *) and are used to store the
char pointers referred to by $1 and $2.

Let's consider what happens when we want to accomodate the
DEFAULT keyword. We'll assume that the structure
itemptr contains a itemptr->default variable for
storing a simple indicator of whether the DEFAULT
keyword was used or not.

This is nicer, because we've removed the repetition in our actions.
However, I've added the comment /*segv*/, because
that's exactly what we would get. This is because of
the way yacc works. Here's what would happen:

Yacc gets the token LABEL, and, since it doesn't
have the complete rule yet, it just "saves it for later"
(by pushing it onto a stack).

Yacc sees the non-terminal symbol default, and starts
processing the rule for it.

The default rule is simply: DEFAULT, or nothing.
There are only two possibilties:

a

the next token is, say, EXEC.
It's not DEFAULT, but that's OK (syntactically), because
empty is a valid rule in this case. Yacc saves the EXEC
token for later. Since the current rule (empty) is both valid and
complete, yacc executes the action for it.

b

Let's assume that the next token is DEFAULT.
This is all that we need to complete the rule default,
so yacc executes
the action for it.

In either case, we execute an action in the default rule.

Now we get to our EXEC token. We have all the
elements of our menu_item rule (including the
non-terminal default rule), so we can now execute the
action for that.
But WAIT! We're just about to allocate itemptr using the
function new_item(),
but we've already used itemptr in an action for the previous rule, default.
It's too late, we've already crashed.

Yacc provides a simple yet elegant solution to this dilemma, by
extending the concept of the "value" of a token to non-terminal
symbols, like default.

First, we have to declare the default in our
Yacc Declarations section, like this:

%type <num> default

Note that this is the same approach as we used for %token
definitions. We can even use types which are not used by the
lexer, but we must add them to our %union declaration.

We assign the value of the left-hand-side of the rule by assigning
a value to $$.

Just as you might do with C-program variables, it is
possible to typecast a value which is being accessed via
the $ mechanism, and that by writing it as
$<type>n
instead of $n, where type
is a member of our %union
declaration. For example:

{ printf("addr: %x\n", $<num>1 ); }

Normally, you would not need to type-cast token-values
in this fashion.

Until now, we have assumed that if we don't specify
an action, yacc does nothing. This is not true. In the
absence of and explicit action, yacc applies a default
action of:

{ $$=$1; }

Or, put simply, the left-hand-side
inherets the value from the 1st symbol on the right-hand-side.
(or inherets the "absence of a value", as the case may be).

This can be a problem if the two symbols have different
type: yacc will complain about an
error: type clash (`num' `') on default actionor similar. This means that we should be
fussy about the way be assign types to symbols in yacc, and take the
same care as we would when writing normal C-program code.

In the early stages of development, you can get rid of
these errors by adding an explict action, like
{ }which will override the default action.

The topic of tokens and their associated semantic values
is covered in the section
"Semantic Values"
in the Bison documentation.

Yacc generates a single function called yyparse().
This function requires no parameters. and returns either a 0
on success, and 1 on failure. "Failure" in the case of the
parser means "if it encounters a syntax error".

The yyerror() function is called when yacc encounters an
invalid syntax. The yyerror() is passed a single
string (char*) argument. Unfortunately, this string usually
just says "parse error", so on its own, it's
pretty useless. Error recovery is an important topic which
we will cover in more detail later on. For now, we
just want a basic yyerror() function like this:

yyerror(char *err) {
fprintf(stderr, "%s\n",err);
}

See the section
"Interface"
in the Bison documentation for more information.

This is prototype parser because, as you may notice, it does not
contain any actions. If you compile it, and run it with a suitable
openwin-menu file, you get exactly: nothing.

But that's OK for now, as
we just want to check that we have understood our input-syntax properly, and
that the parser works as expected.

We'll build the prototype using make. I like to use something like this
Makefile.

As with any compiled source, you never get it right the first time,
so you have to contend with the usual syntax errors which need fixing.
In addition, you may also encounter some
type clashes, which we mentioned above. Once you are over these hurdles,
yacc will generate C-source code that compiles.

A Shift operation is what the parser does when it saves a token
for later use. (Actually, it pushes the token onto a stack)

A Reduce operation is what the parser does when
resolves a set of tokens into a single, complete rule.
(The corresponding tokens are removed from the stack and replaced
with a single token representing the rule. The stack has been
"reduced").

It is not strictly necessary to eliminate these
warnings, as yacc will still generate an operational parser.
However, it is important to understand these conflicts,
to be sure that the parser we get we wanted,
and not the parser we asked for :-).

Both of these warnings mean that there is an ambiguity
in our ruleset.

Our prototype parser does not contain any such situations.
Refer to the section
"Algorithm"
in the Bison documentation for examples and detailed information.

A reduce/reduce conflict occurs when the same set of tokens
can be used to form two different rules.

In the event of a reduce/reduce conflict, the parser will use
the first rule that appears in the grammar.
You can think of this as being analogous to lex's "first match" rule.

Reduce/reduce conflicts are usually an indication of an error in the
way the grammar rules have been defined, as the whole point of
having a grammar is to avoid such blatant ambiguities.
It is usually possible (and desirable) to eliminate all
reduce/reduce conflicts from your grammar rules, either by rewriting
some rules, or redefining the grammar (if possible).

See the section
"Reduce/Reduce"
in the Bison documentation for further explanation.

The prototype parser, olmenu-proto1 should be
invoked with the -v option.

olmenu-proto1 -v /usr/openwin/lib/openwin-menu

This will print out a lot of messages about what tokens were
encountered, and which rules they were used in. This information
can be instrumental in determining "where the parser went wrong",
if you feed it an input file which is known to be correct,
but get a "parse error" from the parser.

See the section
"Debugging"
in the Bison documentation for further explanation.

As it stands, when our parser encounters an incorrect
syntax, it will simply print the message "parse error" and exit.
At the very least, we would like an indication of the line-number
at which the error occured. To do this, we will need the
co-operation of the lexer, since the parser is often "unaware"
of newline characters.
More often than not, newlines are not considered significant
form the point of view of the grammar.

The lexer must set an additional variable, yylloc, every
time it encounters a newline.
This variable must be declared in the lexer like this:

extern YYLTYPE yylloc;

This variable is a structure of type YYLTYPE.
We can use any of it's members to store relevant information,
but we are not required to use all of them. One is usually enough.

We should also initialise our line-counter. To this end, we can
use the lex macro YY_USER_INIT in the lex declarations
section:

#define YY_USER_INIT yylloc.first_line=1;

Lex should increment this variable every time it encounters a
newline in the input stream. Be careful of using yyless()
and REJECT in your lex actions, because they can confuse
your line-counter.

Once we have set up the lexer to provide line-number information,
we can use it within any yacc action.
We refer to a token's line-number
by using
@n, in the same way that $n is used
to refer to token's value. For example:

{ fprintf(stderr,"Line %d\n",@1.first_line); }

This requires additional code to be generated by yacc.
The appropriate source-code is generated automatically
if you make
use of the @n notation within a yacc action.
(at least, this is true of bison - I'm not sure of
how yacc deals with this).

Once we have coaxed yacc into producing the necessary
code, we can also use yylloc within yyerror(),
as follows:

As it stands, when our parser encounters an incorrect
syntax, it will simply print the message "parse error" and exits.

We've just added line-number information, but
even this is a bit vague. Also, we may not want our
parser to simply give up at the first syntax error, in the same
way the C-compiler gives you more than one error message
at a single invocation.

Yacc provides a special symbol for handling errors.
The symbol is called error and it should
appear within a grammar-rule. For example, we could have:

If the parser encounters something other than a LABEL
after a '<', it will discard all tokens up to the
next '>'. This technique can be useful for keeping
brackets balanced during error-recovery.

In our case, keep in mind that any unidentifiable text after a valid
LABEL
is converted to an EXEC token by the lexer, and the EXEC token would swallow
any subsequent '>'.
So this rule in unlikely to trap any real errors,
except something like < keyword >

In fact, it is likely to work against us, because we may not see another
'>' at all, so we will miss out on a lot of input!

Error recovery is a tricky business, and we don't always get the
results we really wanted. It doesn't pay to be too fussy about
this aspect of the parser.

See the sections
"Error Recovery"
and
"Action Features"
in the Bison documentation for further explanations and examples.

Up till now, we have been concentrating purely on the
grammar rules.
This has been quite deliberate, because we should get
the grammar right first, before we spend too much time writing
(and re-writing) our actions.

The basic goal of our parser will be create an in-memory
representation of the input data. Or, in other words, a set of
structures, linked lists, arrays, etc, which we can
use and manipulate.
Once we have our memory-representation, we can
generate out new menu-file in the CDE format, using
conventional techniques and statements such as printf.

We are free to use any technique
we like to build up our memory structures, however
yacc lends itself particularly well to
a specific style of doing so.

Yacc gives us the ability to assign values to non-terminal symbols,
as if they were yylval values supplied by the lexer.
This allows us to pass values and pointers along through the grammar rules
themselves. We can use this feature to build up linked-lists
and other structures "on the fly".
We looked at this technique once before, in the section
titled
"Using Token Types in Yacc Actions",
where we passed along an integer value representing the
presence or absence of the keyword DEFAULT.

If we allow a menu-item to include a pointer
to another menu, then the tree-like nature of
the menus will be catered for, so
we do not need to keep a separate structure to define
the menu-tree.

Our target format, the dtwmrc format, does not use
"inline" submenus like olvwm, Instead, each menu is refered
to by it's name, and it's contents are defined after the
end of the current menu. For example:

The first thing we are likely to encounter in our
menu-file is a menu item. We can tell if it's the first
menu-item, because it will be processed by the rule:

menu_items : menu_item ;

and not

menu_items : menu_items menu_item ;

This makes the first menu_item rule a
good place to allocate our struct menu.
We do this mainly so we can get access to the variables
first_item and last_item

We can "hold onto" our struct menu without
resorting to global variables by circulating it within
the menu_items rule. So both alternatives of
the rule return a struct menu * (though only
the first version allocates it).

When there are no more menu items in the current
menu or sub-menu, the rule for menu_items is complete.
We then store the
struct menu * pointer returned by menu_items:

In the menufile rule, it is stored
in the global variable top_menu.

In the submenu rule, it is stored
in the struct menu_item which invokes
the menu.

The only other place we would want to call new_menu() is
from within the rule for submenu.

submenu : label default MENU '\n' menu_items end '\n'
;

This rule contains the symbol menu_items,
which allocates a struct menu and returns it
complete with a linked list of menu-items.
So there's not much else to do, other than to create an
item using new_item() , and use it to store
the struct menu pointer
which menu_items returns for us.

Our rule for menu_items works properly,
and returns the right values in all cases. However,
one peculiarity does arise from the way this rule
interacts with the submenu rule.

If our input contains a submenu before the first ordinary
item, we call new_menu() for the submenu before
the parent-menu. This is not strictly a problem, but
it would be nicer to have the menu_list in the
order-of-appearance. We can fix this by re-writing
the rule for menu_items as:

Lastly, there is the issue of the options
rule. These items are not really true menu items at all, just
some options we can set, like title and columns.
These have been defined as members within struct menu
However, at the time we are parsing the options, we do not have
access to the correct struct menu.
We could use a global-variable to store the current menu,
but it might get tricky to
restore the correct menu when we get to the end of a submenu.

It is easier to let the parser do the work, and pass down
values in the manner to which we are now accustommed.

We will use the struct menuitem as a vehicle to
transfer the TITLE and COLUMNS
options. We will then make the add_item() function
treat these items as "special", and free the structure when done,

We are not going to use mid-rule actions, but I'll mention
them anyway, because they can be useful.

Consider the problem of the rule: submenu.

Let's say we defined the rule for menu_items, such that
it returned the start of a linked-list of menu-items,
instead of a struct menu pointer.
We would then link the struct menu_item list
to the struct menu in actions for the rules
menufile and submenu.
The problem arises of: how do we build the linked-list?
We could:

Build the list backwards, and prepend each new
menu-item to the start of the item-list

menu_items : menu_items menu_item
{ $2->next=$1; $$=$2 }
;

Build it forwards, by traversing the list to
the end for each new menu-item, and appending to the end of
the list.

Clearly, the last option is the nicest, but we cannot use
a simple global variable as our last_item pointer.
If we did, we would get to end of the first submenu, and we would want
to restore last_item to the last item of the parent
menu. But we've already lost this pointer,
because we've been using the same global-variable last_item
to process the submenu.

So we need to put last_item somewhere other than a global
variable. Our struct menu does nicely. Except that our existing
rule for a submenu will read all the menu_items before
the action for the submenu rule is executed (remembering what happened
when processing our keyword DEFAULT ).

submenu : label default MENU '\n' menu_items end '\n'
;

There is a way we can allocate our struct menu
before we start processing menu_items, and that is
to use a mid-rule action, like this:

In order to generate a complete dtwmrc file, we should
really be reading the files nominated in the rule menu_file.
It is technically feasible to open a submenu file as soon
as we parse the menu_file line.
Due to the use of "look-ahead" in both the lexer and parser, it is
not simply a matter of changing yyin.

Flex provides a set of functions and macros to handle this,
and these are described in the section "Multiple Input Buffers"
in the flex man page.

In addition, we should be aware that the parser may also
be one token in front, since it uses a "look-ahead" token
to decide which rule to apply.

So, while it is possible to change input-streams on-the-fly,
it adds another dimension of complexity to our program.

In our case, a much simpler solution is to read a whole
file at a time, calling yyparse() once per file.
We stitch the new memory-structures in manually.
The function read_menu_files() does just this. It:

traverses our memory structure looking
for instances of menu-files,

opens the file, and assigns it to yyin

calls yyparse()

links in the new top-level menu at the appropriate point.

As luck would have it, menus which are read from other files are
just appended to the end of our menu_list, so
we don't need to do anything else to our menu_list

We process the menus in the order they appear menu_list,
and this list can be appended-to while we are processing the current
menu. Hence, this simple solution also caters for files nested
several levels deep.

See the section
"Pure Calling"
in the Bison documentation for further information.

Pure calling is not required in our case, as we are not calling the
parser from within the parser. We are calling yyparse() repeatedly, but we
allow it to complete before calling it again.

The openwin-menu file syntax also allows the filenames of nested
menu files to be referred to using environment variables. For example,
/usr/openwin/lib/openwin-menu-gamescould be written
$OPENWINHOME/lib/openwin-menu-gamesor even
${OPENWINHOME}/lib/openwin-menu-games

Traditionally, this variable expansion would be done using
C-library calls like strspn()
and strtok(), or maybe using a page or so of hand-written
code.

However, this seems quite tedious after what we've been doing with
lex. After all, this kind of thing is what lex is best at.

The GNU lexer, flex, provides several functions to do exactly
what we want: scan a string variable, just like it would a text-file.

The details of the flex functions required to scan a string
are described in the flex man page.

Our lexer, olmenu.l
contains a function expand_env() which uses this feature of
lex. It takes the filename string which was parsed earlier,
and processes it, using lex to break the string into several
components.
If the component is an environment variable, the lexer returns
the result of the corresponding getenv(). If the
component isn't an environment variable, it is returned verbatim.

The function expand_env() calls yylex()
repeatedly, and concatenates the components into the desired result string.

In order that our string-scanner does not interfere with
our regular text-scanner, we will set an "exclusive" start condition,
ENV, before we call yylex().

We need to remember that yylex() returns 0 when it
reaches the end-of-input, but we can return any other integer values
we like. In this case, we'll return yytext, or the result of
a getenv(yytext). Keep in mind that getenv()
returns 0 if the variable doesn't exist, so we have to be careful not
to return the getenv() result without checking it.

The really good thing about using flex to do this, is the
amount of control we have over the resulting string-scanner.
For example, if we wanted to allow backslash quoting to suppress the
environment-variable substitution,
we could change the rule:

Make contains implicit rules for building programs from yacc source files.
The source file must end in .y for the implicit rules to work.

The default rule for yacc is:

.y.c :
$(YACC) $(YFLAGS) $<
mv -f y.tab.c $@

I like to call my parser files something.l
and something.y Unfortunately, this introduces a conflict in
make's default rules, which would try to generate something.c
from both the lex and yacc source. Hence, I like to "brew my own"
implicit rules, which generate something_lex.c and
something_yacc.c instead.

Note that there are some incompatibilities at this level between
yacc and bison. Yacc likes to call its output files y.tab.c
for the parser and y.tab.h for the token definitions.
Bison prefers to use basename.tab.c and
basename.tab.h (respectively).
Bison will generate y.tab.c
and y.tab.h if it is invoked with the -y flag.

info2www

This document contains numerous hyper-references to the Bison
documentation. In order to use these, you must install the
perl script info2www on a web-server on your local machine.

The Bison documentation is in "Info" format, and the info2www
gateway is arguably the most convinient way of accessing this
type of documentation.