Growing a Compiler, Chapter 4

format compact% Help MATLAB save screen space

4 Using the GEM capabilities

Up until now the emphasis has been on choosing new capabilities expressed first as bootstraps, then building the capabilities
into GEM itself. The final built-in capability of GEM added here is multi-character input and output. To sum up the progress,
the implementations have included

gem0 -- minimal, risky

gem1 -- robust implementation, primtive trace

gem2 -- built-in execution of grammars containing whitespace

gem3 -- built-in character classes asciiIOG....digitCFG, better trace

gem4 -- built-in multicharacter input and output symbols

If you are interested in the new files iog4.h, iog4.c and gem4.m, you can download this File Exchange submission and look
at them in the MATLAB editor.

Now it is time to use those built-in capabilities.

G=gem4();
GEM=G.run;

4.1 Kleene * and Kleene + in IOGs

It is proposed that a rule such as

g = p*;

will mean that a g is a sequence of zero or more p, and

g = p+;

will mean that a g is a sequence of one or more p.

4.1.1 Implementing Kleene *

GEM knows nothing about the Kleene star, so progress is made by transforming regular expression grammars back to the original
IOG form. The tricks are to replace each

r*

with a new symbol (say R) and add new rules

R = rR; R =;

The tricks are separately applied: the resulting two IOGs are concatenated to make a grammar acceptable to GEM.

The limitation is that the upper case phrase name are going to magically appear if the grammar contains the corresponding
lower case phrase name, and, as usual, names dla must be avoided if they are going to be used with the built-in meaning.

The new rules are created by IOG nostar1 which throws away the grammar and makes a few new rules. nostar1 contains 26 rules of the form

s = 'a*' "A=aA;A=;";
s = 'b*' "B=bB;B=;";
...
s = 'z*' "Z=zZ;Z=;";

As it turns out, extra rules are generated if there is more than one occurrence of a particular a*. This is not a logical
problem: backtracking cures all. But it is a catastrophic performance problem. gem4 function nodup cleans out the extras. It is not necessary for postfix below, but is necessary for pretty. Just be clear that the bootstrapping game is still being played, nodup is implemented using GEM.

The r* items are replaced in the grammar by IOG nostar2, a version of pretty containing 26 rules of the form

s = 'a*' "A";
s = 'b*' "B";
...
s = 'z*' "Z";

Both nostar1 and nostar2 require deblanked input.

4.1.1 A new version of pretty.

To keep the ability to display IOGs, a new version of pretty is needed to handle multicharacter operators and the Kleene * and +. It is convenient to use *, in particular, to describe
sequences of white space. Here is the new pretty.

4.2 Using * in postfix.

The postfix grammar in the previous chapter used a functional-programming technique to achieve left-associative expression
syntax; the grammar could have been instead been written using the Kleene *. Here it is:

As one can see, the Kleene * in postfix0 has been transformed away. Applying the newly generated IOG gives postfix expressions as before.

fprintf('%s\n', GEM('2*(6+a+4)-2/7', postfix, 'DL'));

26a+4+*27/-

4.3 Implement Kleene +

Adding Kleene operator + could be dealt with in much the same way as * . A simpler solution is to translate r+ into rr*, then use nostar1 and nostar2 to eliminate the *. The IOG noplus is a version of pretty with 26 rules of the form

4.4 Grammar introspection

Gramars can be used to examine grammars. In particular the sets Vn, Vg, Vi and Vo (see chapter 1) can be extracted. The Vn
extractor, for example, copies all of the LHS names and discards the rest. Function nodup removes the duplicates. Here Vn is used on itself.

The sets Vi and Vo are less interesting, because the use of the built-in character classes make the sets large, and because
the use of multi-character symbols makes this output hard to read. The use of the built-in character classes is not recorded
in the IOGs, but rather in the call of G.run. And multi-character input and output symbols are not implemented in Vi or Vo.
The choice here is to extract only the symbols explicitly defined in the IOG.