Kleene’s Theorem

Stephen Cole Kleene was an American mathematician who’s groundbreaking work in the sub-field of logic known as recursion theory laid the groundwork for modern computing. While most computer programmers might not know his name or the significance of his work regarding computable functions, I am willing to bet that anyone who has ever dealt with regular expressions is intimately familiar with an indispensable operator that resulted directly from his work and even bears his name, the *, or as it is formally known, the Kleene star.

While his contributions to computer science in general cannot be overstated, Kleene also authored a theorem that plays an important role in artificial intelligence, specifically the branch known as natural language processing, or NLP for short. Kleene’s Theorem relates regular languages, regular expressions, and finite state automata (FSAs). In short, he was able to prove that regular expressions and finite state automata were the same thing, just two different representations of any given regular language.

Background

Strings

If you have not already done so, it might be worthwhile to read my post “Not String Theory – String Facts” about alphabets, strings and their relationship to formal languages.

Regular Languages

Informally, regular languages are defined as any language that can be represented by a regular expression, and by regular expression I mean in the strict formal sense, not including any of the ways modern programming languages have extended them. For our purposes here, however, we need to look at the formal definition of a regular language.

Given an alphabet Σ, the collection of regular languages over it can be defined recursively:

The empty language Ø, ie., a set with no strings or the empty string – an empty set, is a regular language.

The language containing just the empty string, {ε}, is a regular language.

The Kleene star of A, A*, is a regular language. (The same is true for B.)

There are no other regular languages over Σ than those described above.

Regular Expressions

Simply put, regular expressions are nothing more than strings of characters: letters, numbers, operators, quantifiers, anchors, and grouping symbols that define a search pattern that, more often than not, is applied to identifying strings. The core regular expression syntax was developed by Stephen Kleene in 1956 as a way to describe regular languages.

The structural rules and application of regular expressions are well beyond the scope of this post, although a rudimentary understanding would certainly be useful to understanding what’s to come. If you are computer programmer and have used regular expressions or if you are otherwise familiar with them, it’s not too difficult to see the relationship between them and regular languages. Each regular expression defines a specific regular language.

Finite State Automata

Finite state automata (FSA), also commonly known as finite state machines (FSM), are ways to model mathematic computations. A finite state automata, as evidenced by the name is made up of a finite number of discrete states of which it can only be in one state at any given time. An FSA has a starting state and moves, or transitions, from one state to another based on the input it receives during each step as it iterates over the input. Ultimately an FSA will reach what has been designated a final state, if it is successful, or if unsuccessful, it will reach a state from which it can’t move either because the it has run out of input or there doesn’t exist a matching transition from the state it is in.

An FSA can be thought of as directed graph, where each of the states are the nodes and the edges represent the transitions and are labeled with the input that causes the transition.

Formally a finite state automata is defined by 5 parameters. For finite state machine A:

The above description is of a deterministic finite state automata. This means that for each state there is only transition for any specific input symbol. There is only one edge leaving a node given any input symbol. Some FSAs may instead have states with multiple transitions (edges) for any given input symbol. These are known as non-deterministic finite state machines. The only difference in the definition above for non-deterministic FSAs it that the transition function δ now returns a set of states instead of just a single state, δ : S X Σ →Ρ(S).

With all of that covered, we can finally take a look at Kleene’s Theorem.

Kleene’s Theorem

Theorem (Kleene, 1956) The family of languages over Σ* that are regular is equal to the least family of languages over Σ* that contains the empty set, the singleton sets, and that is closed under star, concatenation, and union.

Huh? I admit that even now I have trouble translating that. So if you are, too, you’ll have to take my word for it that what the theorem means is:

A language over an alphabet is regular if and only if it can be accepted by a finite automaton. Given that regular expressions are representations of regular languages, this theorem has two implications:

Any regular language can be accepted by a finite state automaton. (There exists a finite state automaton for every regular expression.)

Any language accepted by a finite state automaton is regular. (There exists a regular expression for any language accepted by a finite state automaton.)

Long story short, regular expressions and finite state automata are two sides of the same coin. The implications of this fact run far and deep but would require a post of their own to cover them in detail. In my next post, however, I will be proving Kleene’s Theorem. Look for it very soon.