A compiler is a program that translates one language to another. The target language, that is the output of
the compiler, is often assembler language or object code, but this is not necessarily the case. Many
compilers convert one high level language to another high level language. One notable example of this is
cfront, the UNIX C++ compiler, which translates C++ into C. The C language output is then fed into cc,
the standard UNIX C compiler, which produces object code. Another example of a compiler that converts
from one high level language to another is the BEACON back end. (Magnus Rimvall's paper on
BEACON will be posted on James Alan Farrell's web site in the near future. When it is a link will be
provided here.)

In order to write a program that performs a translation on a language, that language must first be precisely
and accurately defined. By precisely, we mean ALL aspects of the language must be set out. If any
aspect of a language is left undefined, compiler writers will be on their own to figure out what to do, and
each compiler will do these things differently. The language will not be portable. By accurately we mean
that the language must be defined consistently to the smallest detail. Any details left undefined or in
conflict will lead compiler writers to each resolve the problems in their own way. Again, this leads to a
language that is not portable.

In order to create a language that meets these criteria, we must develop the concept of language in a
formal mathematical sense. There are two parts to creating a language in the mathematical sense. The
first part is defining rules for the language. This is called syntax. The
second part is defining the meaning behind the rules. This is called semantics. Semantics is a difficult thing to get a handle on, and rules for
defining semantics for a language have not yet been codified.

In order to define the syntax for a language we start with an alphabet. An alphabet is a series of
tokens, not necessarily letters, that can be used to create strings in the alphabet. A string
is a series of tokens pieced together according to the rules of the language. Strings are also
sometimes referred to as sentences. So in order to define the syntax for a language, we need an alphabet
and a set of rules used to manipulate the alphabet. The rules and the alphabet can be expressed together in
one of two forms: Chomsky Normal Form (CNF) or Extended Backus Naur Form (EBNF). Both of these forms will be covered in
detail. A description of the syntax of a language in either of these forms is called a grammar.

If we take English as an example of a language by the above definition we find something quite
interesting. Namely that what we normally understand as the alphabet is not what we define above as
being the alphabet. According to our mathematical definition of a language, the words of the English
language are the alphabet of the English language.

According to the rules of the English language, a sentence must have an object and an action on that
object. In addition a sentence can have prepositional phrases, modifiers, and other constructs, which we
do not necessarily need for this example. The following sentence is an example of an English language
sentence:

The cat sat.

In this example there are three tokens, or members of the alphabet: The, cat and
sat. There is an object cat and a verb sat. The following are not English
language sentences:

Sat ran run jump.
The paper on compilers.

We will refer to these as phrases. The first phrase is made of four tokens. It does not meet the
rules of English for two reasons. The first is that it is missing an object, and the second is that it has four
actions, not one. The second phrase also contains four tokens, and is not an English sentence because it is
missing an object.

The alphabet is sometimes referred to as the lexical elements of a language.

Part of the task of a compiler is to determine if a program meets the rules of the language. If not, it
contains a syntax error.

Semantics refers to the meaning behind a sentence in a language. Developing mathematical rules to
describe semantics has proven to be an elusive undertaking. It is possible to define a meaning for each
token in an alphabet, but when tokens are placed together there are subtle interactions in meaning that
make precise semantic definition of a language quite difficult. This is a major reason for continued
difficulties in making modern languages completely portable.

Take the following English language example:

The bank president ate a plate of spaghetti.

This is a good solid sentence that drives AI gurus nuts. Programs that attempt to understand the
semantics of English often translate this to mean the president of a river bank at a plate, and then a bunch
of spaghetti fell on his lap. (I last took a course in AI about 4 years ago. This little problem may have
been overcome by now.)

Solutions for describing the semantics of a language tend to be exponential, since interactions between a
modifier and the object of the modifier have to be described. Consider how you would tell a computer, in
a generic way, that bank in the above example refers to a financial institution. After you've
developed a solution consider how the computer would then handle the following:

The boat washed up onto the bank.

When developing a compiler there are generally well accepted semantics to most constructs. Assignment
statements and for loops exist in some form in most languages. Where semantics are unclear, as in a
construct unique to a particular language, English language descriptions generally have to suffice.
Hopefully the English language description can be made clear enough that all compilers based on the
description will work the same way.

Language theory owes a great deal to Noam Chomsky. In the 1950s Chomsky studied many spoken
languages and attempted to develop a formal method of describing language. Among other achievements,
it was Chomsky who first divided language study into syntax and semantics. Another achievement was
the development of a language used to specify languages.

Chomsky Normal Form (CNF) uses a series of intermediate tokens to describe syntax rules. The left side
of an expression in CNF shows a string with intermediate symbols, the right side shows how that string
can be translated. The translation may contain terminal symbols, which are the tokens in the language's
alphabet, or it may contain intermediate symbols, or it may contain a combination of the two.
Traditionally terminals are shown with lower case letters and intermediate symbols are shown in upper
case. A CNF grammar always starts with the intermediate symbol S.

S -> a

In this exceedingly simple example, the only sentence possible sentence in the grammar is the single token
a.

S -> aBa
B -> bb

This example again shows a grammar that can only accept a single sentence. S is replaced by aBa, and B
is replaced by bb, giving the sentence abba. In order to add flexibility to the grammar we need to be able
to make choices in how a symbol can be expanded. CNF shows a choice in two ways. The symbol may
be repeated on the right side to show multiple rules for expansion:

S -> aBa
B -> bb
B -> aa

Or the vertical bar can be used to show a choice. The following example is exactly equivalent to the
previous example:

S -> aBa
B -> bb | aa

This grammar can produce two different sentences: abba or aaaa. Using the methods outlined so far we
can create a grammar with a large number of intermediate symbols that can produce a large number of
distinct sentences. But all the sentences such a grammar will create will have a definite length. In order to
create a useful language, either spoken or programming, we need to be able to create sentences of
arbitrary length. The only way CNF can be used to define a grammar that can create sentences of
arbitrary length is to allow a symbol to appear in its own expansion rules.

S -> aBa
B -> bb | aaB

Such a language is call recursive. Some sentences that can be generated with this grammar include abba,
aaabba, aaaaabba, etc. The number of a's before the final bba is quite arbitrary.

For a more complex example of CNF, we can look at the more or less standard definition of expression
used by most computer languages.

This grammar is fairly small and compact, but it is also quite complex. There is a lot of ground to cover
here. Analyzing it will be well worth the effort, however.

Notice first of all how the expression 123 is covered by this grammar. S is expanded to EXPRESSION,
which expands to FACTOR, which expands to TERM, which expands to NUMBER. Number expands to
1 NUMBER. NUMBER then expands to 2 NUMBER. And finally this NUMBER expands to 3.

Next notice how 2 + 3 * 6 expands. S expands to EXPRESSION, which expands to TERM +
EXPRESSION. TERM expands to FACTOR, then to NUMBER, then to 2, giving us 2 +
EXPRESSION. EXPRESSION expands to TERM, which expands to FACTOR * EXPRESSION.
FACTOR expands to NUMBER, which then expands to 3. EXPRESSION expands to TERM, which
expands to FACTOR, which expands to NUMBER, which finally expands to 6.

This expression is more complex than it first appears, and requires a deeper analysis. First of all we need
to look at the precedence of the operators. Given the expression 2 + 3 * 6, according to standard
rules of arithmetic we need to perform the 3 * 6 before we add 2, giving a result of 20. Using the above
grammar to develop a compiler will do that. If we reversed the rules for TERM and EXPRESSION, we
would have the opposite case, where 2 is added to 3, then the result is multiplied by 6, giving 30.

The next thing we need to consider is that all recursion is on the right side of the expansions. All
intermediate symbols on the left side are on a lower level than the intermediate symbol being expanded.
Such a language is called a right recursive language. Consider what would happen if you based a
compiler on a grammar with the following rules:

To expand 1 + 2, first S would expand to EXPRESSION, then EXPRESSION would expand to
EXPRESSION. This is an infinite loop. Such a left recursive language cannot be read by mechanical
means. (Strictly speaking, this is not true - the entire program can be placed on a stack then parsed
backwards. This is quite inefficient. Real problems start when one rule is right recursive and another in
the same grammar is left recursive).

Another interesting example is one with parentheses. It will be left to the reader to develop such an
example and see how it expands.

A short time after Chomsky devised CNF, two researchers, Backus and Naur, independently developed a
similar form for specifying language grammar. The Backus Naur form can specify some languages more
compactly than CNF. Over the years other researchers have added symbols to Backus Naur Form,
creating Extended Backus Naur Form (EBNF). Grammars specified in either CNF or EBNF can be
converted directly into a compiler, however most compiler writers prefer to work with EBNF.

EBNF uses the symbol :== to specify the right and left sides of a rule. Terminal symbols are placed in
single quotes.

S :== 'a' B 'a'
B :== 'bb'

The vertical bar is again used to represent a choice in expansion rules and just as in CNF, recursion is
used to develop strings of arbitrary length. Two symbols added to EBNF that do not exist in CNF are the
square brackets ([]) and curly braces ({}). The square brackets are used to denote zero or one occurrence
of an expansion, and curly braces are used to denote an arbitrary, but at least one, number of expansions.

S :== 'a' [B]
B :== {'a'}

In this example, S expands to a or aB, since there can be 0 or 1 B's. B expands to a or aa or aaa or aaaa,
etc. We can rewrite the expression grammar from the last section in EBNF:

This grammar looks a little different. The only recursive rule is in FACTOR. This is because the curly
braces allow strings of arbitrary length to be created without recursion. When translating this to a
compiler, recursive rules translate to recursive procedure calls, rules with curly braces translate to loops
and rules with square brackets translate to if statements.

To expand the expression 1 + 2 + 3 * 4 + 5, start with S expanding to TERM [+ TERM]. The first term
expands to FACTOR, then to NUMBER, then to 1. The second term is really an arbitrary number of
terms, since it is in square brackets. The compiler will expand it to + 2 + TERM + TERM, then expand
the first TERM to FACTOR * FACTOR, giving FACTOR * FACTOR + TERM. Next the FACTORs
will be expanded, giving 3 * 4 + TERM, and then the final term will be expanded in the end to 5.

Sometimes it is useful to denote zero to n instances of a rule. This is done by using curly braces with an
asterisk after them ({}*).

The same concerns we had to consider with CNF follow for EBNF. A left recursive language is a left
recursive language, no matter what grammar is used to specify it, and left recursive languages cannot be
parsed by mechanical means. We must also worry about operator precedence no matter what grammar
we use.

Even though EBNF is rather more confusing than CNF, we will tend to use it over CNF. First of all
EBNF supports loops, so we can avoid some recursion, allowing for more efficient compilers. Also
EBNF is more standard for compiler writers than CNF. When we look at translating EBNF into a parser
it will become easier to understand how EBNF works.

The process of translating grammars in either form is actually quite mechanical. This has lead to a
number of tools for automatically creating compilers. The most well known tool is the UNIX program
YACC (Yet Another Compiler Compiler). (I wanted to provide a link here to a page with a good YACC
explanation. The best I've found is the
SunOS YACC man page,
but I did find an excellent page on
BISON, the GNU version of YACC.) This program is fed a grammar and outputs source code in C
for a parser to match the grammar.
I'm not sure if this program uses CNF or EBNF form. (The BEACON back end was written in Ada, so
we did not use YACC.) Another tool for automatically developing parsers has been built into the
PROLOG programming language. By entering a CNF grammar into the PROLOG interpreter, you can
effectively turn it into a parser for any right recursive language.

Computer languages are languages in the mathematical sense. They are made from tokens and they can
be described with right recursive grammars using either EBNF or CNF.

Taking Pascal as a typical computer language, it contains the following token types, or the following
alphabet: Reserved words, reserved symbols, identifiers, strings, integers and reals. The entire EBNF of
Pascal will not be given, but we do have room for several examples. Examples of reserved words include
Program, begin, if, etc. Examples of reserved symbols include (, ), ', .., etc. An
identifier starts with a character, contains characters, digits and underscores, and can be any length (but
only the first 30 are significant). Strings are any characters enclosed in single quotes. Integers are a
series of digits, with no decimal point, and reals are series of digits that do contain a decimal point and
optionally a power of ten scaling factor (scientific notation).

In addition to the alphabet, when reading a computer program we have to worry about delimiters. A
delimiter separates two tokens. In most cases a blank space is a delimiter. Reserved symbols also act as
delimiters.

Any time a token is encountered that is not part of the language's alphabet, we have encountered a lexical
error. This is different from a syntax error. An example of a lexical error might be the symbol -> in a
Pascal program (C uses this symbol for pointers, but Pascal does not use it).

Pascal is quite standard in its alphabet. Most other computer languages, such as C, Ada, FORTRAN,
BASIC, COBOL, LISP and PROLOG have basically the same alphabets (accept PROLOG has no
reserved words). Some computer languages also have context sensitive reserved words. Such words are
reserved in certain parts of the program and not in others. Imagine that we could declare a procedure
named Integer in Pascal. That means in the Var section Integer would be a reserved
word, but in sections where procedures can be declared it is not considered reserved. It is reserved
depending on the context where it is found. Pascal compilers handle all reserved words as globally
reserved words. We will not make this assumption and we will develop a compiler that can handle context
sensitive reserved words.

One final aspect of computer languages is that they must be computable. The definition of
computable in this case is that the language can be interpreted by a Turing machine. In order to interpret
the language the Turing machine might translate it first to another language, but then the Turing machine
must be able to translate it.

One aspect of spoken languages is that no determination has been made as to whether they are
computable or not. To my knowledge this problem has only been solved for one language (I believe that
language is Dutch, but I could be mistaken), and that language has been shown to NOT be computable.
For other languages it is believed that determining whether they are computable or not is an undecidable
problem. Some would argue that the human brain is nothing more than an advanced computer, and
therefor any spoken language must be computable. This is not true because spoken languages contain
fuzzy constructs, very open to interpretation and to miss-interpretation. When it comes to spoken
language mistakes are often made. Computer languages must be much more strictly defined, in order to
keep them computable.