Context-free grammar

An equivalent grammar Backus-Naur Form is often used for describing the syntax/grammar of programming languages.

A context-free grammar is a formal grammatical system for describing a particular class of language. The basic idea of such a grammar is to provide a set of rules which can be used to generate all possible grammatical 'sentences'; these rules must also avoid generating any ungrammatical sentences. A context-free grammar is not powerful enough to describe a natural language such as English, but it is powerful enough to describe nearly all programming languages.

Contents

Example of a very simple context-free grammar

This very simple example can handle only a tiny subset of English and should not be taken seriously as a guide to English grammar or vocabulary. It is based on an example by David Crystal
[2] and is a slight extension of an example on slide 9 of this lecture.
[3]

Sentence -> NounPhraseVerbPhrase # A Sentence consists of a NounPhrase followed by a VerbPhraseNounPhrase -> DeterminerNoun # A NounPhrase consists of a Determiner followed by a NounVerbPhrase -> VerbNounPhrase # A VerbPhrase consists of a Verb followed by a NounPhraseDeterminer -> a, the # A Determiner is one of: a or theNoun -> boy, girl, dog # A Noun is one of: boy or girl or dogVerb -> chased, heard, saw # A Verb is one of: chased or heard or saw

Notation used above:

The symbol "->" means "can be replaced by".

When there are several items on the right hand side separated by commas, just one of these items can be selected.

The comments preceded by "#" are not part of the generative grammar, but hopefully will help folks understand the example a little better.

By making appropriate substitutions we can generate such sentences as:

The dog chased a girl.

The girl heard the dog.

The boy saw a girl.

The stages in the generation of these examples are as follows:

Sentence

-> NounPhraseVerbPhrase

-> DeterminerNounVerbNounPhrase

-> DeterminerNounVerbDeterminerNoun

By appropriate selection of words to substitute for Determiner, Noun, and Verb we can produce the 3 example sentences above.

The simple grammar defines a few words which can be substituted for Determiner, Noun, and Verb. There are two possibilities for Determiner, three for Noun, and three for Verb, with both Determiner and Noun appearing twice in the expansion. Hence the total number of grammatical sentences which can be generated by this simple grammar is 108 (= 2 * 3 * 3 * 2 * 3). In this particular case the possible words have been carefully chosen so that all of these 108 sentences also actually mean something.

Note that the use of italics and boldface in this example is merely a convenient way of indicating the difference between items which can be expanded further (e.g. concepts such as NounPhrase) and items which can not be expanded further (e.g. words such as girl). In the linguistic literature, items which can be expanded further are referred to as 'non-terminal' items, while items which can't be expanded further are known as 'terminal' items.

The 'comma' notation for alternatives is a form of shorthand. In some formulations this would be shown as several separate rules, so that:

Determiner -> a, the

could be shown as two rules:

Determiner -> aDeterminer -> the

Caution

While a generative grammar may produce sentences which are grammatically correct, there is no certainty that such sentences will actually mean anything. The classic example of such a meaningless sentence is from Chomsky's book
[4] with the grammatically correct sentence:

"Colorless green ideas sleep furiously."

This example illustrates the difference between grammar/syntax and meaning/semantics.

Another problem when analysing a sentence is that many English words have multiple meanings, so that there may be more than one way to interpret a sentence. A simple example is the sentence:

"Time flies."

Is this an instruction to check how fast flies are flying, or an indication that time appears to be passing quickly?

A varied and entertaining selection of such ambiguities has been compiled by Jeff Gray.[5]

Formal grammar

Note: this section and subsequent sections are, by necessity, slightly more technical/formal than the earlier parts of this article, though specialist notation and abbreviations have been largely avoided. Make sure that you understand the 'Simple example' above before proceeding further, since references will be made to that example to help explain the technicalities, and variations on that example will be used to illustrate new ideas.

A formal generative grammar has four components (as described in slide 19 of this lecture
[6]
):

a finite number of non-terminal concepts (e.g. VerbPhrase in the example, or anything else in italics),

a finite number of terminal items (e.g. girl in the example, or anything else in bold-face).

a finite set of rules for transforming non-terminal items (e.g. the example has 6 rules).

a single starting point (e.g. Sentence in the example).

Notes:

Terminal items are those which can not be transformed into anything else.

A list of all the terminal items is effectively the vocabulary from which sentences are built.

Non-terminal items are those which can be transformed into something else.

Linguists have developed various classes of grammars by using constraints to govern what sort of rules can be used for any particular grammar. The constraints for context-free grammars are detailed in a later section.

Changing the rules

An relatively small change to even a single rule can make a large change in the language which can be generated. For example, change the first rule in the simple example to read:

This means that a Sentence is either a NounPhrase followed by a VerbPhrase or a NounPhrase followed by a VerbPhrase followed by and followed by a NounPhrase followed by a VerbPhrase. In addition to the shorter sentences already discussed, this could also generate a sentence such as: "A dog saw a girl and the dog chased the girl". The total number of grammatically-correct sentences which can be generated by this modified grammar is now 11,772 (108 + 108*108), though some of these may not be very meaingful.

Towards infinity

A significantly different rule change can have an astounding effect. Consider changing the first rule to read:

Sentence -> NounPhraseVerbPhrase, NounPhraseVerbPhraseandSentence

This can generate a potentially infinite number of sentences (any number of these five-word sentences linked by and). This is an example of what is known as a recursive definition, whereby a Sentence is partially defined in terms of itself. In such a recursive rule there must be at least one transformation option which does not involve Sentence, otherwise the transformation could never stop.

The above change is known as right-recursive; a left-recursive formulation is also possible, and can in this example generate exactly the same sentences. As shown below under 'Context-free grammar', there are situations where the choice of left- or right-recursive formulations can change the meaning.

Sentence -> NounPhraseVerbPhrase, SentenceandNounPhraseVerbPhrase

The distinction between left- and right-recursive grammars becomes important when trying to parse/analyse a sentence; some types of parser work best with left-recursive grammars, while other types work best with right-recursive grammars. This distinction is particularly relevant in connection with compiling or translating a program written in some high-level programming language so that it can be used on a computer.

The use of left-recursive or right-recursive rules may also affect the meaning, as shown in the 'arithmetic expression' example below.

Constraints for a context-free grammar

The transformation rules for context-free grammars are fairly simple:

Only a single non-terminal item can appear on the left-hand side of a rule.

The right-hand side of any rule must be one of:

[nothing].

any combination of terminal items and non-terminal items.

In the linguistic literature, [nothing] is often represented by the symbol ε.

Examples

Arithmetic expressions

The following example shows how context-free grammars can cope with arithmetic expressions involving possibly nested brackets. Also an expression such as "2 + 3 * 4" could be interpreted as either "(2 + 3) * 4" = 20 or as "2 + (3 * 4)" = 14. The latter interpretation is that which is normally used - multiplication "*" is said to have a higher priority than addition "+" - so the grammar must incorporate this priority concept. Furthermore, consecutive operators of the same priority are normally applied from left to right, so that "8 - 2 + 3" is interpreted as "(8 - 2) + 3" = 9, rather than as "8 - (2 + 3)" = 3. This ordering must also be built in to the grammar.

To avoid distracting side-issues, this example only deals with numbers which consist solely of a single digit. The example is a simplified version of the definition of an arithmetic expression given in the Revised Algol 60 Report (Section 3.3.1).
[7]

Expression -> Term # an Expression is either a TermExpression -> ExpressionAddSubOpTerm # or an Expression followed by + or - followed by a Term
#
Term -> Primary # a Term is either a PrimaryTerm -> TermMulDivOpPrimary # or a Term followed by * or / followed by a Primary
#
Primary -> Number # a Primary is either a NumberPrimary -> (Expression) # or an Expression enclosed in brackets
#
MulDivOp -> *. / # multiplication and division have the same priority
AddSubOp -> +. - # addition and subtraction have the same priority
Number -> 0, 1, ... 9 # this simple-minded definition of a Number is enough for this example

Notes:

Both Expression and Term are each defined via a left-recursive rule; this ensures that operators of the same priority are applied from left to right.

The distinction between Expression and Term ensures that multiplication/division are applied before addition/subtraction.

Expression is indirectly recursive via Term and Primary to allow bracketed sub-expressions to any depth.