We saw in the preceding classes that regular languages are
surprisingly important, despite the simplicity of
DFAs/NDFAs/RegExps. There are many languages that are clearly
important to us (like properly formatted decimal numbers) that
in fact regular. However, we also proved that there are many
languages that are fundamental to computing that are not
regular --- i.e. DFAs/NDFAs/RegExps are not powerful enough to
capture. For instance, we proved that with regular languages
we can't even require that parentheses match up in a simple
language for arithmetic expressions. So finite automata (and
regular expressions) are not powerful enough even for the core
activities of compiling and interpreting programs. This
motivates us to consider new, more powerful formalisms.
First this will mean "context free grammars", and a bit later
this will mean pushdown automata.

Context Free Grammars: Informally

A context free grammar is a set of "rewrite rules"
that say that a particular string can be "rewritten" by
substituting particular expressions in place of particular
symbols. A simple example: We may have a rule
stating that the character A can be replaced with the
string aaA (which we'll write as A → aaA).
With this rule, we can "rewrite" the string babAb as
babaaAb. We distinguish between symbols for which
there are no substitution rules (called terminals) and
those for which there are subsitution rules (called
nonterminals). The goal in using these rules is to
produce a string of terminals - after all, at that point no
further substitutions are possible.

Example:
Given the rules A → aaA and A → b
we can produce the string aaaab from the string
A as follows:

start with A

apply the first rule and get aaA

apply the first rule and get aaaaA

apply the second rule and get aaaab

We say that we have derived string aaaab from the string
A.

A collection of these rules along with a nonterminal
designated as the string with which derivations begin is a grammar.
When all rules have the form X → w, where
X is a single nonterminal, the grammar is called
Context Free.

Example: The following rules, with nonterminal
S designated as the start symbol, define the
language of all even-length strings over the alphabet
{a,b}.

S → A

A → aaA

A → abA

A → baA

A → bbA

A → λ

A derivation of abaabb, for example, would be

S ⇒ A ⇒ abA ⇒ abaaA ⇒ abaabbA ⇒ abaabb

where the "⇒" denotes the application of a single
grammar rule.

Life gets substantially more interesting when grammar
derivations produce strings containing several nonterminals,
and during the derivation process you have many choices as to
which rules to apply.

Example: Consider the following grammar:

S → A

A → A,A

A → c

In derivations for this grammar we have decisions - decisions
about which non-terminal to replace as well as which rule to
use in the replacement. For example, the string c,c,c
can be derived in several different ways:

CFGs are kind of like regular expressions, in the sense that
they are a nice way to specify a language, as opposed to
testing whether a string belongs to a language. So CFGs are
similar to regular expressions, but how do they compare?
Can they express everything that regular exrpessions can?

Excercise:
Here's a list of some of the languages defined by regular
expressions.
See if you can write a CFG generating any of them.

(a|b)*, i.e. any string over {a,b}

a(a|b)*b, i.e. any string that starts with
a and ends with b

abb|bba, i.e. {abb,bba}

Can
they express languages that cannot be expressed by regular
expressions?

Excercise:
Here's a list of some of the languages we proved are not
regular, i.e. cannot be accepted by finite automata.
See if you can write a CFG generating any of them.

{anbn | n ≥ 0}

{anbncn | n ≥ 0}

The language of properly parenthesized lists of a's

The language of palindromes over {a,b,c}

The language of all strings of a's whose
length is prime

A formal definition of a CFG

Just as with DFAs and NDFAs we start with an informal
definition of CFG, but we eventually need a formal definition
so that we may begin to write algorithms and ultimately proofs.
Formally:

Definition: A Context Free Grammar is a 4-tuple
G = (Σ,NT,R,S), where

Σ is the alphabet, i.e. the set of terminal
symbols

NT is the set of nonterminal symbols

R ⊆ NT × (Σ ∪ NT)*,
i.e. the elements of R are pairs whose first
componant is an element of NT and whose second
componant is a string over the alphabet Σ ∪ NT