To those who enjoy it, it's an unforgettable pleasure worth any number of
return trips.

If you've read this column more than once, you're probably a pretty
inquisitive person. At one point or another, you've probably asked yourself
what it takes to (gulp!) create your own programming language.

Compiler writing is often thought to be one of the last black arts in the
programming trade. Even though anyone who gets a degree in computer science
suffers through a semester or two of compiler writing, most programmers
endeavor to forget as quickly as possible what they supposed to have learned
in those classes.

Creating your own language is like drinking port. It's an acquired taste, but
to those who enjoy it, it's an unforgettable pleasure worth any number of
return trips. I'm one of the nut cases who enjoy writing compilers and
interpreters, and I've read literally every book in print on the subject, not
to mention many more now out of print.

Having done this for a few years now, I can safely assert that almost all the
books are way off the mark. Most of them require that you write the whole
compiler by hand, omitting such necessary tools as YACC and Lex, and few of
them give you the pleasure of presenting the source code to a complete
compiler. Usually, the really good stuff is left, in a phrase I've come to
dread, as "an exercise for the reader."

Not that printing the source code to a compiler is easy to do. For one thing,
it can take up a lot of valuable editorial pages that the publisher would
rather see devoted to more "substantial" topics such as theory and exercises
left to the reader. For another, there's always a big problem in deciding what
the target machine is. Make it a PC, and you leave out a lot of college
students who woek on Suns and VAXs.

The truth is that when I had to get the job done on my first compiler, I
learned more from MKS Lex and YACC (Mortice Kern Systems, 35 King Street
North, Waterloo, Ontario, Canada N2J 2W9; (519) 884-2251). A compiler is made
up roughly of these parts. The scanner takes the individual characters that
make up the program's keyboards and other "lexical" elements (its words), such
as numbers, math operators, and variable names. It converts each item into an
integer identifier (such as the number 257 for PRINT [underscore] and 256 for
STRING [underscore] in the case of the BASIC line PRINT "Hello, world."). It
also associates a value with that identifier, such as the string "Hello,
world" in the case of STRING [underscore]. (Often, as with PRINT, no such
value is needed.) The scanner normally has little knowledge of the language
and will cheerfully accept PRINT CLS WHILE + SELECT SELECT as valid input. The
parser takes these lexical elements (viewing them as their integer
identifiers) and sees that they're in the right order. This is the heart of
the compiler and, normally, the most difficult to write. The parser then adds
identifiers such as variables and subroutines to the symbol table, and it
generates code or returns an errot if the parser deems the input program
incorrect.

What do the oddly named YACC and Lex have to do with this? It turns out that
while the parser is incredibly tricky to get right--without automated help you
will almost always create a langauge with invalid contracts--it is also quite
amendable to being automated. This is where most textbooks get it completely
wrong. They'll spend two or three chapters on archaic parsing theory, then
provide little or no real code to illustrate their outmoded idead. What YACC
lets you do is create a description of the language using an extended version
of the Backus-Naur Form you often see in language syntax charts, and it
creates a C function called yyparse() that does all the hard work for you.
Oh--you don't like C? Then use the -LP option to create a Turbo Pascal
routine! Writing scanners is normally less tricky, but that also can be
automated using Lex. Lex generates a C (or Turbo Pascal) routine called
yylex() that's normally called by yyparse().

The MKS Lex and YACC manual is simply the best applied introduction to
compiler writing I've ever seen. Starting with a simple command line
calculator, it proceeds to a C-like language with functions, procedures, local
and global variables, and a rich set of control structures. It has both an
interpreter and a compiler. The only thing I don't like is that the assembler
code the compiler generates isn't for a PC, but for an imaginery Motorola-like
processor. Using YACC and Lex isn't easy, but this tutorial is great. If
you're willing to spend a few months studying, you can come out with the
beginnings of your own homemade language, in Turbo Pascal or ANSI C!

I've posted a grammer for an interpreter called Tiny Basic for Windows in the
file TBA-SIC?? .ZIP, where ?? stands for a version number such as 10 for
version 1.0 or 21 for version 2.1. It has the C and Turbo Pascal source code
for everything you need to get your own language going.