super secret hq

A Formal Language For Grammars

August 30, 2012

In the previous article, we described salc, a simple calculator. The description was presented very informally. In this article, we’re going to look at how describe the language formally. By formally, we mean that we’ll give a set of rules to identify valid programs. Those rules define the grammar for the language.

The Basic Rules

Before we go into those rules, we need to understand a few definitions and the rules behind the rules.

A letter is any upper- or lower-case letter.

A digit is any single-digit number.

A string is any text surrounded by quote marks. The string can use either single or double quotes, but the same mark must start and end the string.

A symbol is any item that is not a letter, digit, or string.

Spaces are ignored.

The plus sign indicates concatenation of items.

The vertical bar indicates choice between items.

Items enclosed by square brackets may be repeated zero or one times.

Items enclosed by curly brackets may be repeated zero, one or more times.

Items enclosed by parentheses are grouped together as a single item.

Now that we have the foundation, let’s mix metaphors and dive into the rules for grammars.

Concatenation

Concatenation means joining two strings together, one after the other. For example, concatenating “foo” with “bar” gives us the string “foobar.” In our rules, the concatenation operator is the plus sign. It takes two operands, one in front and behind, just like the plus sign does in math. Don’t make the mistake of thinking that concatenation is like adding. 1 + 2 is 3 when added, but 1 + 2 is 12 when concatenated.

Repeating

Repetition is a form of concatenation. When an item is enclosed by curly brackets, we may repeat it any number of times, from zero to infinity. An item that is repeated zero times results in an empty string. For example,

1 + {2} + 3

Would result in 13 if we repeated 2 zero times, 123 if we repeated 2 once and 12223 if we repeated 2 three times.

When an item is enclosed by square brackets, it is optional. An optional item can be repeated zero or one times. It’s actually a short-cut for “choose either the item or the empty string.” For example,

1 + [2] + 3

Can result in either 13 or 123. Note that it can’t be 1223 since we can repeat the 2 at most one time.

Grouping

Parentheses group items together. We do this for convenience and to override the evaluation order of the operators.

In algebra, 1 + 2 * 3 equals 7, not 9. That’s because multiplication has a higher priority than addition, so we multiply first. In our rules, choice is the highest priority, followed by repetition then concatenation. For example,

1 + 2 | 3 + 4

We would choose from 2 or 3 first, then do the concatenation. Our results could be 124 or 134.

With parentheses, we can make this explicit:

1 + ( 2 | 3 ) + 4

That is exactly the same as 1 + 2 | 3 + 4. If we wanted to change the order of evaluation, we would use parentheses to group the concatenations:

( 1 + 2 ) | ( 3 + 4 )

This now results in either 12 or 34.

We can nest items in many levels of parentheses. We must always evaluate from the innermost level.

1 + ( 2 | ( ( 3 | 4 ) | ( 5 | 6 ) ) )

We would choose 3 or 4 first, then 5 or 6, and then from the results of those two choices before choosing 2 or the result. The following results are all valid:

12
13
14
15
16

Strings

Now that that’s out of the way, we need to discuss a couple more things about strings.

Previously, we said that a string is any text surrounded by quote marks. The quote marks are used as delimiters for the string, so the mark starting the string must match the mark ending the string. If your text is

Hello, world!

Then the string could be either “Hello, world!” or ‘Hello, world!’ You just have to be sure to use the same type on both ends of the string.

What happens if the text includes a quote mark? Sometimes we can just use the other type as our delimiter. For example, the text

He's my brother.

has a single quote in it. If we use double quotes as delimiters, there’s no problem. We can’t always do that, though. This next text has both:

Tom said, "He's my brother."

We have to escape the mark that is used to delimit the text everywhere it appears in the text. To do that, we just double it up, so the string looks like this:

"Tom said, ""He's my brother."""

or this:

'Tom said, "He''s my brother."'

The second example is harder to read in some browsers since two single quotes together can look like a double quote.

Note that inside a string, only the two delimiters together are escaped. If the text happens to contain this combination, don’t panic. Just concentrate on the delimiter and it will work out. The text

Foo""Bar

becomes one of the following strings:

'Foo""Bar'
"Foo""""Bar"

Another alternative is to use concatenation:

"Foo" + '"' + '"' + "Bar"

This works because strings are just a convenient way to express concatenation.

Concatenation Redo

To save space and make it easier to read our grammars, we’re going to change the concatenation operator from the plus sign to a space (actually, zero or more spaces).

A + B + C

becomes

A B C

The Grammar

Our grammars are adapted from rules presented by Wirth in a letter to the ACM:

A grammar defines all the sentences in a language that are valid. In our case, since we’re building interpreters, these will be the statements that are syntactically valid. Statements can be syntactically valid and at the same time be semantically invalid. We’ll cover that in a later article. For now, we’re going to concentrate on recognizing syntactically valid statements.

According to our rules, a grammar consists of productions. They’re called “production” because they’re used to produce valid sentences.

Productions consist of an identifier followed by an equal sign followed by an expression. From that, we can build any context-free language.

Displaying Results

Let’s try building a grammar that can display results. We’ll use the following examples as test cases for the grammar:

1 + 2 =
1 = + = 2 =
=

The rules for addition haven’t changed, so we will start with the same grammar.

With no more input, we can’t match the symbol for digit. It’s required, so this test case is not valid. It should be so we must fix the grammar. Since including it as part of the expression didn’t work, let’s try putting it at the same level as the expression.

That’s accepted, so the test passes. There is a problem, though. We haven’t discussed it but grammars should be unambiguous. There should only be one valid way to accept input. Our rules do that for the expression, but not for the display. With the rules above, there are at least two ways to accept the input: