6. Using the BNF converter to make bovine tables

The BNF converter takes a file in "Bovine Normal Form" which is similar
to "Backus-Naur Form". If you have ever used yacc or bison, you will
find it similar. The BNF form used by semantic, however, does not
include token precedence rules, and several other features needed to make
real parser generators.

It is important to have an Emacs Lisp file with a variable ready to take
the output of your table (see See section 5. Preparing a bovine table for your language.) Also, make sure that the
file `semantic-bnf.el' is loaded. Give your language file the
extension `.bnf' and you are ready.

The comment character is #.

When you want to test your file, use the keyboard shortcut C-c C-c
to parse the file, generate the variable, and load the new definition
in. It will then use the settings specified above to determine what to
do. Use the shortcut C-c c to do the same thing, but spend
extra time indenting the table nicely.

Make sure that you create the variable specified in the
%parsetable token before trying to convert the BNF file. A
simple definition like this is sufficient.

(defvar semantic-toplevel-lang-bovine-table
nil
"Table for use with semantic for parsing LANG.")

If you use tokens (created with the %token specifier), also
make sure you have a keyword table available, like this:

(defvar semantic-lang-keyword-table
nil
"Table for use with semantic for keywords.")

Specify the name of the keyword table with the %keywordtable
specifier.

The BNF file has two sections. The first is the settings section, and
the second is the language definition, or list of semantic rules.

6.1 Settings

A setting is a keyword starting with a %. (This syntax is taken
from yacc and bison. @xref{(bison)}.)

There are several settings that can be made in the settings section.
They are:

Setting:%start<nonterminal>

Specify an alternative to bovine-toplevel. (See below)

Setting:%scopestart<nonterminal>

Specify an alternative to bovine-inner-scope.

Setting:%outputfile<filename>

Required. Specifies the file into which this files output is stored.

Setting:%parsetable<lisp-variable-name>

Required. Specifies a lisp variable into which the output is stored.

Setting:%setupfunction<lisp-function-name>

Required. Name of a function into which setup code is to be inserted.

Setting:%keywordtable<lisp-variable-name>

Required if there are %token keywords.
Specifies a lisp variable into which the output of a keyword table is
stored. This obarray is used to turn symbols into keywords when applicable.

Setting:%token<name> "<text>"

Optional. Specify a new token NAME. This is added to a lexical
keyword list using TEXT. The symbol is then converted into a new
lexical terminal. This requires that the %keywordtable specified
variable is available in the file specified by %outputfile.

Setting:%token<name> type "<text>"

Optional. Specify a new token NAME. It is made from an existing
lexical token of type TYPE. TEXT is a string which will be
matched explicitly. NAME can be used in match rules as though they were
flex tokens, but are converted back to TYPE "text" internally.

Specify setup code to be inserted into the %setupfunction.
It will be inserted between two specifier strings, or added to
the end of the function.

When working inside %( ... )% tokens, any lisp expression can be
entered which will be placed inside the setup function. In general, you
probably want to set variables that tell Semantic and related tools how
the language works.

Here are some variables that control how different programs will work
with your language.

Variable:semantic-flex-depth

Default flexing depth.
This specifies how many lists to create tokens in.

Variable:semantic-number-expression

Regular expression for matching a number.
If this value is nil, no number extraction is done during lex.
Symbols which match this expression are returned as number
tokens instead of symbol tokens.

The default value for this variable should work in most languages.

Variable:semantic-flex-extensions

Buffer local extensions to the lexical analyzer.
This should contain an alist with a key of a regex and a data element of
a function. The function should both move point, and return a lexical
token of the form:

( TYPE START . END)

nil is also a valid return.
TYPE can be any type of symbol, as long as it doesn't occur as a
nonterminal in the language definition.

Variable:semantic-flex-syntax-modifications

Updates to the syntax table for this buffer.
These changes are active only while this file is being flexed.
This is a list where each element is of the form:

(CHAR CLASS)

Where CHAR is the char passed to modify-syntax-entry,
and CLASS is the string also passed to modify-syntax-entry to define
what class of syntax CHAR is.

Variable:semantic-flex-enable-newlines

When flexing, report 'newlines as syntactic elements.
Useful for languages where the newline is a special case terminator.
Only set this on a per mode basis, not globally.

Variable:semantic-ignore-comments

Default comment handling.
t means to strip comments when flexing. Nil means to keep comments
as part of the token stream.

Variable:semantic-symbol->name-assoc-list

Association between symbols returned, and a string.
The string is used to represent a group of objects of the given type.
It is sometimes useful for a language to use a different string
in place of the default, even though that language will still
return a symbol. For example, Java return's includes, but the
string can be replaced with Imports.

Variable:semantic-case-fold

Value for case-fold-search when parsing.

Variable:semantic-expand-nonterminal

Function to call for each nonterminal production.
Return a list of non-terminals derived from the first argument, or nil
if it does not need to be expanded.
Languages with compound definitions should use this function to expand
from one compound symbol into several. For example, in C the
definition

int a, b;

is easily parsed into one token, but represents multiple variables. A
functions should be written which takes this compound token and turns
it into two tokens, one for A, and the other for B.

Within the language definition (the `.bnf' sources), it is often
useful to set the NAME slot of a token with a list of items that
distinguish each element in the compound definition.

This list can then be detected by the function set in
semantic-expand-nonterminal to create multiple tokens.
This function has one additional duty of managing the overlays created
by semantic. It is possible to use the single overlay in the compound
token for all your tokens, but this can pose problems identifying
all tokens covering a given definition.

Please see `semantic-java.el' for an example of managing overlays
when expanding a token into multiple definitions.

Variable:semantic-override-table

Buffer local semantic function overrides alist.
These overrides provide a hook for a `major-mode' to override specific
behaviors with respect to generated semantic toplevel nonterminals and
things that these non-terminals are useful for.
Each element must be of the form: (SYM . FUN)
where SYM is the symbol to override, and FUN is the function to
override it with.

Available override symbols:

SYMBOL

PARAMETERS

DESCRIPTION

find-dependency

(token)

Find the dependency file

find-nonterminal

(token & parent)

Find token in buffer.

find-documentation

(token & nosnarf)

Find doc comments.

abbreviate-nonterminal

(token & parent)

Return summary string.

summarize-nonterminal

(token & parent)

Return summary string.

prototype-nonterminal

(token)

Return a prototype string.

concise-prototype-nonterminal'

(tok & parent color)

Return a concise prototype string.

uml-abbreviate-nonterminal'

(tok & parent color)

Return a UML standard abbreviation string.

uml-prototype-nonterminal'

(tok & parent color)

Return a UML like prototype string.

uml-concise-prototype-nonterminal'

(tok & parent color)

Return a UML like concise prototype string.

prototype-file

(buffer)

Return a file in which prototypes are placed

nonterminal-children

(token)

Return first rate children. These are children which may contain overlays.

nonterminal-external-member-parent

(token)

Parent of TOKEN

nonterminal-external-member-p

(parent token)

Non nil if TOKEN has PARENT, but is not in PARENT.

nonterminal-external-member-children

(token & usedb)

Get all external children of TOKEN.

nonterminal-protection

(token & parent)

Return protection as a symbol.

nonterminal-abstract

(token & parent)

Return if TOKEN is abstract.

nonterminal-leaf

(token & parent)

Return if TOKEN is leaf.

nonterminal-static

(token & parent)

Return if TOKEN is static.

beginning-of-context

(& point)

Move to the beginning of the

current context.

end-of-context

(& point)

Move to the end of the

current context.

up-context

(& point)

Move up one context level.

get-local-variables

(& point)

Get local variables.

get-all-local-variables

(& point)

Get all local variables.

get-local-arguments

(& point)

Get arguments to this function.

end-of-command

Move to the end of the current

command

beginning-of-command

Move to the beginning of the

current command

ctxt-current-symbol

(& point)

List of related symbols.

ctxt-current-assignment

(& point)

Variable being assigned to.

ctxt-current-function

(& point)

Function being called at point.

ctxt-current-argument

(& point)

The index to the argument of

the current function the cursor
is in.

Parameters mean:

&

Following parameters are optional

buffer

The buffer in which a token was found.

token

The nonterminal token we are doing stuff with

parent

If a TOKEN is stripped (of positional information) then this will be the
parent token which should have positional information in it.

Variable:semantic-type-relation-separator-character

Character strings used to separation a parent/child relationship.
This list of strings are used for displaying or finding separators
in variable field dereferencing. The first character will be used for
display. In C, a type field is separated like this: "type.field"
thus, the character is a ".". In C, and additional value of "->"
would be in the list, so that "type->field" could be found.

Variable:semantic-dependency-include-path

Defines the include path used when searching for files.
This should be a list of directories to search which is specific to
the file being included.
This variable can also be set to a single function. If it is a
function, it will be called with one arguments, the file to find as a
string, and it should return the full path to that file, or nil.

This configures Imenu to use semantic parsing.

Variable:imenu-create-index-function

The function to use for creating a buffer index.

It should be a function that takes no arguments and returns an index
of the current buffer as an alist.

Simple elements in the alist look like `(INDEX-NAME . INDEX-POSITION)'.
Special elements look like `(INDEX-NAME INDEX-POSITION FUNCTION ARGUMENTS...)'.
A nested sub-alist element looks like (INDEX-NAME SUB-ALIST).
The function imenu--subalist-p tests an element and returns t
if it is a sub-alist.

6.2 Rules

RESULT is a non-terminal, or a token synthesized in your grammar.
MATCH is a list of elements that are to be matched if RESULT
is to be made. The optional lambda expression is a list containing
simplified rules for concocting the parse tree.

In bison, each time an element of a MATCH is found, it is
"shifted" onto the parser stack. (The stack of matched elements.) When
all of MATCH1's elements have been matched, it is "reduced" to
RESULT. @xref{(bison)Algorithm}.

The first RESULT written into your language specification should
be bovine-toplevel, or the symbol specified with %start.
When starting a parse for a file, this is the default token iterated
over. You can use any token you want in place of bovine-toplevel
if you specify what that nonterminal will be with a %start token
in the settings section.

MATCH is made up of symbols and strings. A symbol such as
foo means that a syntactic token of type foo must be
matched. A string in the mix means that the previous symbol must have
the additional constraint of exactly matching it. Thus, the
combination:

symbol "moose"

means that a symbol must first be encountered, and then it must
string-match "moose". Be especially careful to remember that the
string is a regular expression. The code:

punctuation "."

will match any punctuation.

For the above example in bison, a LEX rule would be used to create a new
token MOOSE. In this case, the MOOSE token would appear.
For the bovinator, this task was mixed into the language definition to
simplify implementation, though Bison's technique is more efficient.

To make a symbol match explicitly for keywords, for example, you can use
the %token command in the settings section to create new symbols.

%token MOOSE "moose"
find_a_moose: MOOSE
;

will match "moose" explicitly, unlike the previous example where moose
need only appear in the symbol. This is because "moose" will be
converted to MOOSE in the lexical analysis stage. Thus the symbol
MOOSE won't be available any other way.

If we specify our token in this way:

%token MOOSE symbol "moose"
find_a_moose: MOOSE
;

then MOOSE will match the string "moose" explicitly, but it won't
do so at the lexical level, allowing use of the text "moose" in other
forms of regular expressions.

6.3 Optional Lambda Expressions

The OLE (Optional Lambda Expression) is converted into a bovine lambda
(see See section 5. Preparing a bovine table for your language.) This lambda has special short-cuts to simplify
reading the Emacs BNF definition. An OLE like this:

( $1 )

results in a lambda return which consists entirely of the string
or object found by matching the first (zeroth) element of match.
An OLE like this:

( ,(foo $1) )

executes `foo' on the first argument, and then splices its return
into the return list whereas:

( (foo $1) )

executes foo, and that is placed in the return list.

Here are other things that can appear inline:

$1

the first object matched.

,$1

the first object spliced into the list (assuming it is a list from a
non-terminal)

'$1

the first object matched, placed in a list. i.e. ( $1 )

foo

the symbol foo (exactly as displayed)

(foo)

a function call to foo which is stuck into the return list.

,(foo)

a function call to foo which is spliced into the return list.

'(foo)

a function call to foo which is stuck into the return list in a list.

(EXPAND $1 nonterminal depth)

a list starting with EXPAND performs a recursive parse on the token
passed to it (represented by $1 above.) The semantic list is a common
token to expand, as there are often interesting things in the list.
The nonterminal is a symbol in your table which the bovinator will
start with when parsing. nonterminal's definition is the same as
any other nonterminal. depth should be at least 1 when
descending into a semantic list.

(EXPANDFULL $1 nonterminal depth)

is like EXPAND, except that the parser will iterate over
nonterminal until there are no more matches. (The same way the
parser iterates over bovine-toplevel. This lets you have
much simpler rules in this specific case, and also lets you have
positional information in the returned tokens, and error skipping.

(ASSOC symbol1 value1 symbol2 value2 ... )

This is used for creating an association list. Each SYMBOL is
included in the list if the associated VALUE is non-nil. While
the items are all listed explicitly, the created structure is an
association list of the form:

( ( symbol1 . value1) (symbol2 . value2) ... )

If the symbol %quotemode backquote is specified, then use
,@ to splice a list in, and , to evaluate the expression.
This lets you send $1 as a symbol into a list instead of having
it expanded inline.

6.5 Semantic Token Style Guide

In order for a generalized program using Semantic to work with
multiple languages, it is important to have a consistent meaning for
the contents of the tokens returned. The variable
semantic-toplevel-bovine-table is documented with the complete
list of a tokens that a functional or OO language may use. While any
given language is free to create their own tokens, such a language
definition would not produce a stream of tokens usable by a
generalized tool.

6.6 Minimum Requirements

In general, all tokens returned from a parser should be generated with
the following form:

("NAME" type-symbol ... "DOCSTRING" PROPERTIES OVERLAY)

NAME and type-symbol are the only syntactic elements of a
nonterminal which are guaranteed to exist. This means that a parser
which uses nil for either of these two slots, or some value
which is not type consistent is wrong.

NAME is also guaranteed to be a string. This string represents
the name of the nonterminal, usually a named definition which the
language will use elsewhere as a reference to the syntactic element
found.

type-symbol is a symbol representing the type of the
nonterminal. Valid type-symbols can be anything, as long is it
is an Emacs Lisp symbol.

DOCSTRING is a required slot in the nonterminal, but can be
nil. Some languages have the documentation saved as a comment
nearby. In these cases, DOCSTRING is nil, and the function
`semantic-find-documentation'.

PROPERTIES is a slot generated by the semantic parser harness,
and need not be provided by a language author. Programmatically access
nonterminal properties with semantic-token-put and
semantic-token-get to access properties.

OVERLAY represents positional information for this token. It is
automatically generated by the semantic parser harness, and need not
be provided by the language author, unless they provide a nonterminal
expansion function via semantic-expand-nonterminal.

The OVERLAY property is accessed via several functions returning
the beginning, end, and buffer of a token. Use these functions unless
the overlay is really needed (see 9.1 Token Queries). Depending on the
overlay in a program can be dangerous because sometimes the overlay is
replaced with an integer pair

[ START END ]

when the buffer the token belongs to is not in memory. This happens
when a using has activated the Semantic Database 11.3 Semantic Database.

6.7 Nonterminals for Functional Languages.

If a parser produces tokens for a functional language, then the
following token formats are available.

Variable

("NAME" variable "TYPE" DEFAULT-VALUE EXTRA-SPEC

"DOCSTRING" PROPERTIES OVERLAY)

TYPE is a string representing the type of this variable.
TYPE can be nil for untyped languages. Languages which
support variable declarations without a type (Such as C) should supply
a string representing the default type for that language.

DEFAULT-VALUE can be a string, or something pre-parsed and
language specific. Hopefully this slot will be better defined in
future versions of Semantic.

EXTRA-SPEC are extra specifiers. See below.

Function

("NAME" function "TYPE" ( ARG-LIST ) EXTRA-SPEC

"DOCSTRING" PROPERTIES OVERLAY)

TYPE is a string representing the return type of this function
or method. type can be nil for untyped languages, or for
procedures in languages which support functions with no return data.
See above for more.

ARG-LIST is a list of arguments passed to this function.
Each element in the arg list can be one of the following:

Semantic Token

A full semantic token with positional information.

A partial semantic token

Partial tokens may contain the NAME slot, token-symbol,
and possibly a TYPE.

String

A string representing the name of the argument. Common in untyped
languages.

Type Declaration

("NAME" type "TYPE" ( PART-LIST ) ( PARENTS ) EXTRA-SPEC

"DOCSTRING" PROPERTIES OVERLAY)

TYPE a string representing the of the type, such as (in C)
"struct", "union", "enum", "typedef", or "class".
The TYPE for a type token should not be nil, as even untyped
languages with structures have type types.

PART-LIST is the list of individual entries inside compound
types. Structures, for example, can contain several fields which can
be represented as variables. Valid entries in a PART-LIST are:

Semantic Token

A full semantic token with positional information.

A partial semantic token

Partial tokens may contain the NAME slot, token-symbol,
and possibly a TYPE.

String

A string representing the name of the slot or field. Common in untyped
languages.

PARENTS represents a list of parents of this type. Parents are used
in two situations.

Inheritance

For types which inherit from other types of the same type-type (Such
as classes).

Aliases

For types which are aliases of other types, the parent type is the
type being aliased. The Types' type is the command specifying that it
is an alias (Such as "typedef" in C or C++).

The structure of the PARENTS list is of this form:

( EXPLICIT-PARENTS . INTERFACE-PARENTS)

EXPLICIT-PARENTS can be a single string (Just one parent) or a
list of parents (in a multiple inheritance situation. It can also be
nil.

INTERFACE-PARENTS is a list of strings representing the names of
all INTERFACES, or abstract classes inherited from. It can also be
nil.

This slot can be interesting because the form:

( nil "string")

is a valid parent where there is no explicit parent, and only an
interface.

Include files

("FILE" include SYSTEM "DOCSTRING" PROPERTIES OVERLAY)

A statement which gets additional definitions from outside the current
file, such as an #include statement in C.
In this case, instead of NAME, a FILE is specified.
FILE can be a subset of the actual file to be loaded.

SYSTEM is true if this include is part of a set of system
includes. This field isn't currently being used and may be
eliminated.

Package & Provide statements

("NAME" package DETAIL "DOCSTRING" PROPERTIES OVERLAY)

A statement which declares a given file is part of a package, such as
the Java package statement, or a provide in Emacs Lisp.

DETAIL might be an associated file name, or some other language
specific bit of information.

6.8 Extra Specifiers

Some default token types have a slot EXTRA-SPEC, for extra specifiers.
These specifiers provide additional details not commonly used, or not
available in all languages. This list is an alist, and if a given key
is nil, it is not in the list, saving space. Some valid extra
specifiers are:

(parent . "text")

Name of a parent type/class. This is not the same as a parent for a
type. In C++ and CLOS allow the creation of a function outside the
body of that class. Such functions will set the parent
specifier to a plain text string which is the name of that parent.

(dereference . INT)

Number of levels of dereference.
In C, the number of array dimensions.

(pointer . INT)

Number of levels of pointers.
In C, the number of * characters.

(typemodifiers . ( "text" ... ))

Keyword modifiers for a type. In C, such words would include
register' and volatile'

(suffix . "text")

Suffix information for a variable. Not currently used.

(const . t)

This exists if the variable or function return value is constant.

(throws . ( "text" ... ))

For functions or methods in languages that support typed signal
throwing, this is a list of exceptions that can be thrown.

(destructor . t)

This exists for functions which are destructor methods in a class
definition. In C++, a destructor's name excludes the ~ character. When
producing the name of the function, the ~ is added back in.

(constructor . t)

This exists for functions which are constructors in a class
definition. In C++ this is t when the name of this function is
the same as the name of the parent class.

(user-visible . t)

For functions in interpreted languages such as Emacs Lisp, this
signals that a function or variable is user visible. In Emacs Lisp,
this means a function is interactive.

(prototype . t)

For functions or variables that are not declared locally, a prototype
is something that will define that function or variable for use.
In C, the term represents prototypes generally used in header files.
In Emacs Lisp, the autoload statement creates prototypes.