BASIC interpreter

The application described in this section is a program interpreter for
Basic. Thus, it is a program that can run other programs written in
Basic. Of course, we will only deal with a restricted language,
which contains the following commands:

PRINTexpression

Prints the result of the evaluation of the expression.

INPUTvariable

Prints a prompt (?), reads an integer typed in by the
user, and assigns its value to the variable.

LETvariable=expression

Assigns the result of the evaluation of expression to the variable.

GOTOline number

Continues execution at the given line.

IFconditionTHENline number

Continues execution at the given line if the condition is true.

REMany string

One-line comment.

Every line of a Basic program is labelled with a line number, and
contains only one command. For instance, a program that computes and
then prints the factorial of an integer given by the user is written:

We also wish to write a small text editor, working as a toplevel interactive
loop. It should be able to add new lines, display a program, execute
it, and display the result.
Execution of the program is started with the
RUN command. Here is an example of the evaluation of this
program:

> RUN
factorial of: ? 5
120

The interpreter is implemented in several distinct parts:

Description of the abstract syntax

: describes the definition of data
types to represent Basic programs, as well as their components
(lines, commands, expressions, etc.).

Program pretty printing

: consists of transforming the
internal representation of Basic programs to strings, in order
to display them.

Lexing and parsing

: accomplish the inverse
transformation, that is, transform a string into the internal
representation of a Basic program (the abstract syntax).

Evaluation

: is the heart of the interpreter. It controls
and runs the program. As we will see, functional languages, such as
Objective CAML, are particularly well adapted for this kind of problem.

Toplevel interactive loop

: glues together all the previous parts.

Abstract syntax

Figure 6.2 introduces the concrete syntax, as a BNF
grammar, of the Basic we will implement. This kind of description
for language syntaxes is described in chapter 11,
page ??.

Unary_Op

::=

- | !

Binary_Op

::=

+ | - | * | / |
%

|

= | < | > | <= |
>= | <>

|

& | ' | '

Expression

::=

integer

|

variable

|

"string"

|

Unary_OpExpression

|

ExpressionBinary_OpExpression

|

( Expression )

Command

::=

REMstring

|

GOTOinteger

|

LETvariable = Expression

|

PRINTExpression

|

INPUTvariable

|

IFExpressionTHENinteger

Line

::=

integerCommand

Program

::=

Line

|

LineProgram

Phrase

::=

Line | RUN | LIST | END

Figure 6.2: BASIC Grammar.

We can see that the way expressions are defined does not ensure that a
well formed expression can be evaluated. For instance, 1+"hello" is an expression, and yet it is not possible to evaluate
it. This deliberate choice lets us simplify both the abstract syntax
and the parsing of the Basic language. The price to pay for
this choice is that a syntactically correct Basic program may generate
a runtime error because of a type mismatch.

Defining Objective CAML data types for this abstract syntax is easy,
we simply translate the concrete syntax into a sum type:

We also define the abstract syntax for the commands for the small
program editor:

# typephrase=Lineofline|List|Run|PEnd;;

It is convenient to allow the programmer to skip some parentheses in
arithmetic expressions. For instance, the expression 1+3*4 is
usually interpreted as 1+(3*4). To this end, we associate an integer
with each operator of the language:

Expression printing needs to take into account operator priority to
print as few parentheses as possible. For instance, parentheses are
put around a subexpression at the right of an operator only if the
subexpression's main operator has a lower priority that the main
operator of the whole expression. Also, arithmetic operators are
left-associative, thus the expression 1-2-3 is interpreted as
(1-2)-3.

To deal with this, we use two auxiliary functions ppl and
ppr to print left and right subtrees, respectively. These
functions take two arguments: the tree to print and the priority of
the enclosing operator, which is used to decide if parentheses are
necessary. Left
and right subtrees are distinguished to deal with associativity. If
the current operator priority is the same than the enclosing operator
priority, left trees do not need parentheses whereas right ones may
require them, as in 1-(2-3) or 1-(2+3).

The initial tree is taken as a left subtree with minimal priority
(0).
The expression pretty printing function pp_expression is:

Lexing

Lexing and parsing do the inverse transformation of
printing, going from a string to a syntax tree. Lexing
splits the text of a command line into independent lexical units
called lexemes, with Objective CAML type:

A particular lexeme denotes the end of an expression: Lend.
It is not present in the text of the expression, but is created by the
lexing function (see the lexer function, page
??).

The string being lexed is kept in a record that contains a mutable
field indicating the position after which lexing has not been
done yet. Since the size of the string is used several times and does
not change, it is also stored in the record:

# typestring_lexer={string:string;mutablecurrent:int;size:int};;

This representation lets us define the lexing of a string as the
application of a function to a value of type string_lexer
returning a value of type lexeme. Modifying the current
position in the string is done as a side effect.

The lexer function is very simple: it matches the current
character of a string and, based on its value, extracts the
corresponding lexeme and modifies the current position to the start of
the next lexeme. The code is simple because, for all characters except
two, the current character defines which lexeme to extract. In the
more complicated cases of '<', we need to look at the next
character, which might be a '=' or a '>', producing two different
lexemes. The same problem arises with '>'.

Parsing

The only difficulty in parsing our language comes from expressions.
Indeed, knowing the beginning of an expression is not enough to know
its structure. For instance, having parsed the beginning of an
expression as being 1+2+3, the resulting syntax tree for this part
depends on the rest of the expression: its structure is different when
it is followed by +4 or *4 (see figure 6.3).

Figure 6.3: Basic: abstract syntax tree examples.

However, since the tree structure for 1+2 is the same in both cases,
it can be built. As the position of +3 in the structure is not fully
known, it is temporarily stored.

To build the abstract syntax tree, we use a pushdown automaton
similar to the one built by yacc (see page
??). Lexemes are read one by one and put
on a stack until there is enough information to build the
expression. They are then removed from the stack and replaced by the
expression. This latter operation is called reduction.

The reduce function implements stack reduction. There are two
cases to consider, whether the stack starts with:

an expression followed by a unary operator,

an expression followed by a binary operator and an expression.

Moreover, reduce takes an argument indicating the minimal
priority that an operator should have to trigger reduction. To avoid
this reduction condition, it suffices to give the minimal value, zero,
as the priority.

Once all lexemes are defined and stacked, the function
reduce_all builds the abstract syntax tree with the elements
remaining in the stack. If the expression being parsed is well formed,
only one element should remain in the stack, containing the tree for
this expression.

The parse_exp function is the main expression parsing
function. It reads a string, extracts its lexemes and passes them to the
stack_or_reduce function. Parsing stops when the current
lexeme satisfies a predicate that is given as an argument.

Evaluation

A Basic program is a list of lines. Execution starts at the first
line. Interpreting a program line consists of executing the task
corresponding to its command. There are three different kinds of
commands: input-output (PRINT and INPUT), variable
declaration or modification (LET), and flow control
(GOTO and IF...THEN). Input-output commands interact
with the user and use the corresponding Objective CAML functions.

Variable declaration and modification commands need to know how to
compute the value of an arithmetic expression and the memory
location to store the result. Expression evaluation returns an
integer, a boolean, or a string. Their type is value.

# typevalue=Vintofint|Vstrofstring|Vboolofbool;;

Variable declaration should allocate some memory to store the
associated value. Similarly, variable modification requires the
modification of the associated value. Thus, evaluation of a Basic
program uses an environment that stores the association
between a variable name and its value. It is represented by an
association list of tuples (name,value):

# typeenvironment=(string*value)list;;

The variable name is used to access its value. Variable modification
modifies the association.

Flow control commands, conditional or unconditional, specify the
number of the next line to execute. By default, it is the next
line. To do this, it is necessary to remember
the number of the current line.

The list of commands representing the program being edited under
the toplevel is not an efficient data structure for running the
program. Indeed, it is then necessary to look at the whole list of
lines to find the line indicated by a flow control command
(If and goto). Replacing the list of lines with an array
of commands allows direct access to the command following a
flow control command, using the array index instead of the line
number in the flow control command. This solution requires some
preprocessing called assembly before executing a RUN
command. For reasons that will be detailed shortly, a program
after assembly is not represented as an array of commands but as an
array of lines:

# typecode=linearray;;

As in the calculator example of previous chapters, the interpreter
uses a state that is modified for each command evaluation. At each
step, we need to remember the whole program, the next line to
interpret and the values of the variables. The program being
interpreted is not exactly the one that was entered in the toplevel:
instead of a list of commands, it is an array of commands. Thus
the state of a program during execution is:

# typestate_exec={line:int;xprog:code;xenv:environment};;

Two different reasons may lead to an error during the evaluation of a
line: an error while computing an expression, or branching to an
absent line. They must be dealt with so that the interpreter exits
nicely, printing an error message. We define an exception as well as a
function to raise it, indicating the line where the error occurred.

Assembly

Assembling a program that is a list of numbered lines (type
program) consists of transforming this list into an array and
modifying the flow control commands. This last modification only
needs an association table between line numbers and array indexes.
This is easily provided by storing lines (with their line numbers),
instead of commands, in the array: to find the association between
a line number and the index in the array, we look the line number up
in the array and return the corresponding index. If no line is found
with this number, the index returned is -1.

The execution of a command corresponds to a transition from one
state to another. More precisely, the environment is modified if the
command is an assignment. Furthermore, the next line to execute is always
modified. As a convention, if the next line to execute does not exist,
we set its value to -1

On each call of the transition function eval_cmd, we
look up the current line, run it, then set the number of the next
line to run as the current line. If the last line of the program is
reached, the current line is given the value -1. This
will tell us when to stop.

Program evaluation

We recursively apply the transition function until we reach a state
where the current line number is -1.

The one_command function processes the insertion of a line
or the execution of a command. It modifies the state of the toplevel
loop, which consists of a program and an environment. This
state, represented by the loop_state type, is different from
the evaluation state.

Further work

The Basic we implemented is minimalist. If you want to go further,
the following exercises hint at some possible extensions.

Floating-point numbers: as is, our language only deals with
integers, strings and booleans. Add floats, as well as the
corresponding arithmetic operations in the language grammar. We need
to modify not only parsing, but also evaluation, taking into
account the implicit conversions between integers and floats.

Arrays: Add to the syntax the command DIM
var[x] that declares an array var of size x, and the
expression var[i] that references the ith element of the
array var.

Toplevel directives: Add the toplevel directives SAVE "file_name" and LOAD "file_name" that save a Basic
program to the hard disk, and load a Basic program from the hard
disk, respectively.

Sub-program: Add sub-programs. The GOSUB line
number command calls a sub-program by branching to the given
line number while storing the line from where the call is made. The
RETURN command resumes execution at the line following the
last GOSUB call executed, if there is one, or exits the
program otherwise. Adding sub-programs requires evaluation to
manage not only the environement but also a stack containing the
return addresses of the current GOSUB calls. The GOSUB
command adds the possibility of defining recursive sub-programs.