Regular Expressions and
Automata
Lecture #2-2
September 10
2009
1
Finite State Automata
• Regular Expressions (REs) can be viewed as a way
to describe machines called Finite State Automata
(FSA, also known as automata, finite automata).
• FSAs and their close variants are a theoretical
foundation of much of the field of NLP.
2
Finite State Automata
• FSAs recognize the regular languages represented
by regular expressions
– SheepTalk: /baa+!/
a
b
q0
a
q1
a
q2
!
q3
q4
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5
transitions
3
Formally
• FSA is a 5-tuple consisting of
–
–
–
–
–
Q: set of states {q0,q1,q2,q3,q4}
: a finite alphabet of symbols {a,b,!}
q0: a start state
F: a set of accept/final states in Q {q4}
(q,i): a transition function mapping Q x  to Q
a
b
q0
a
q1
a
q2
!
q3
q4
4
State Transition Table for
SheepTalk
Input
State
b
a
!
0
1
Ø
Ø
1
Ø
2
Ø
2
Ø
3
Ø
3
Ø
3
4
4
Ø
Ø
Ø
5
Recognition
• Recognition (or acceptance) is the process of
determining whether or not a given input should be
accepted by a given machine.
• Or… it’s the process of determining if as string is in
the language we’re defining with the machine
• In terms of REs, it’s the process of determining
whether or not a given input matches a particular
regular expression.
• Traditionally, recognition is viewed as processing an
input written on a tape consisting of cells containing
elements from the alphabet.
6
• FSA recognizes (accepts) strings of a regular
language
–
–
–
–
baa!
baaa!
baaa!
…
• Tape metaphor: a rejected input
q0
a
b
a
!
b
7
Recognition
•
•
•
•
•
Simply a process of starting in the start state
Examining the current input
Consulting the table
Going to a new state and updating the tape pointer.
Until you run out of tape.
8
D-Recognize
9
q0
q1
q2
b
q3
a
q3
a
q3
a
a
q4
!
Input
State
b
a
!
0
1
Ø
Ø
1
Ø
2
Ø
2
Ø
3
Ø
3
Ø
3
4
4
Ø
Ø
Ø
10
Key Points
• Deterministic means that at each point in processing
there is always one unique thing to do (no choices).
• D-recognize is a simple table-driven interpreter
• The algorithm is universal for all unambiguous
languages.
– To change the machine, you change the table.
11
Key Points
• Crudely therefore… matching strings with regular
expressions (ala Perl) is a matter of
– translating the expression into a machine (table) and
– passing the table to an interpreter
12
Recognition as Search
• You can view this algorithm as a degenerate kind of
state-space search.
• States are pairings of tape positions and state
numbers.
• Operators are compiled into the table
• Goal state is a pairing with the end of tape position
and a final accept state
• Its degenerate because?
13
Formal Languages
• Formal Languages are sets of strings composed of
symbols from a finite set of symbols.
• Finite-state automate define formal languages
(without having to enumerate all the strings in the
language)
• Given a machine m (such as a particular FSA) L(m)
means the formal language characterized by m.
– L(Sheeptalk FSA) = {baa!, baaa!, baaaa!, …} (an infinite set)
14
Generative Formalisms
• The term Generative is based on the view that you
can run the machine as a generator to get strings
from the language.
• FSAs can be viewed from two perspectives:
– Acceptors that can tell you if a string is in the language
– Generators to produce all and only the strings in the
language
15
Three Views
• Three equivalent formal ways to look at what we’re
up to (not including tables – and we’ll find more…)
Regular Expressions
Finite State Automata
Regular Languages
16
Determinism
• Let’s take another look at what is going on with drecognize.
• In particular, let’s look at what it means to be
deterministic here and see if we can relax that notion.
• How would our recognition algorithm change?
• What would it mean for the accepted language?
17
Determinism and Non-Determinism
• Deterministic: There is at most one transition that
can be taken given a current state and input symbol.
• Non-deterministic: There is a choice of several
transitions that can be taken given a current state
and input symbol. (The machine doesn’t specify how
to make the choice.)
18
Non-Deterministic FSAs for
SheepTalk
b
q0
a
q1
b
q0
a
q2
a
q1
a
!
q3
a
q2
q4
!
q3
q4

19
FSAs as Grammars for Natural
Language
dr
the
q0
rev
q1
q2
hon

mr
pat
q3
l.
q4
robinson
q5
q6
ms
mrs

Can you use a regexpr to capture this too?
20
Equivalence
• Non-deterministic machines can be converted to
deterministic ones with a fairly simple construction
(essentially building “set states” that are reached by
following all possible states in parallel)
• That means that they have the same power; nondeterministic machines are not more powerful than
deterministic ones
• It also means that one way to do recognition with a nondeterministic machine is to turn it into a deterministic
one.
• Problems: translating gives us a not very intuitive
machine, and this machine has LOTS of states
21
Non-Deterministic Recognition
• In a ND FSA there exists at least one path directed
through the machine by a string that is in the
language defined by the machine that leads to an
accept condition..
• But not all paths directed through the machine by an
accept string lead to an accept state. It is OK for
some paths to lead to a reject condition.
• In a ND FSA no path directed through the machine
by a string outside the language leads to an accept
condition.
22
Non-Deterministic Recognition
• So success in a non-deterministic recognition occurs
when a path is found through the machine that ends
in an accept.
• However, being driven to a reject condition by an
input does not imply it should be rejected.
• Failure occurs only when none of the possible paths
lead to an accept state.
• This means that the problem of non-deterministic
recognition can be thought of as a standard search
problem.
23
The Problem of Choice
• Choice in non-deterministic models comes up again
and again in NLP.
Several Standard Solutions
• Backup (search, this chapter)
– Save input/state of machine at choice points
– If wrong choice, use this saved state to back up and try
another choice
• Lookahead
– Look ahead in the input to help make a choice
• Parallelism
– Look at all choices in parallel
24
Backup
• After a wrong choice leads to a dead-end (either no
input left in a non-accept state, or no legal
transitions), return to a previous choice point to
pursue another unexplored choice.
• Thus, at each choice point, the search process needs
to remember the (unexplored) choices.
• Standard State Space Search.
• State = (FSA node or machine state, tape-position)
25
Example
b
q0
a
q1
a
a
q2
q2
!
q3
\
q4
26
ND-Recognize Code
27
Example
Agenda:
28
Example
29
Example
Agenda:
30
Example
31
Example
Agenda:
32
Example
33
Example
Agenda:
34
Example
Agenda:
35
Example
Agenda:
36
Example
37
Example
Agenda:
38
Example
Agenda:
39
Example
Agenda:
40
Example
Agenda:
41
Key Points
• States in the search space are pairings of tape
positions and states in the machine.
• By keeping track of as yet unexplored states, a
recognizer can systematically explore all the paths
through the machine given an input.
42
Infinite Search
• If you’re not careful such searches can go into an
infinite loop.
• How?
43
Why Bother?
• Non-determinism doesn’t get us more formal power
and it causes headaches so why bother?
– More natural solutions
– Machines based on construction are too big
44
Compositional Machines
• Formal languages are just sets of strings
• Therefore, we can talk about various set operations
(intersection, union, concatenation)
• This turns out to be a useful exercise
45
Union
• Accept a string in either of
two languages
46
Concatenation
• Accept a string consisting of a string from language L1
followed by a string from language L2.
47
Negation
• Construct a machine M2 to accept all strings not
accepted by machine M1 and reject all the strings
accepted by M1
– Invert all the accept and not accept states in M1
• Does that work for non-deterministic machines?
48
Intersection
• Accept a string that is in both of two specified
languages
• An indirect construction…
– A^B = ~(~A or ~B)
49
Why Bother?
• ‘FSAs can be useful tools for recognizing – and
generating – subsets of natural language
– But they cannot represent all NL phenomena (Center
Embedding: The mouse the cat ... chased died.)
50
Summing Up
• Regular expressions and FSAs can represent
subsets of natural language as well as regular
languages
– Both representations may be impossible for humans to
understand for any real subset of a language
– But they are very easy to use for smaller subsets
• Next time: Read Ch 3
• For fun:
– Think of ways you might characterize features of your email
using only regular expressions
51