CS 4344/7344 - Lecture 3

Beginning Parsing Techniques

What we get from syntax
Syntax provides constraints that can be used in extracting the meaning
of a sentence. At this point in the course, syntax appears to tell us
only that specific groups of words constitute various components of the
sentence. It may not seem like this tells us much about meaning.
Later on we'll see that there are techniques for using this
information to determine what action or event is described by the sentence,
who or what caused the action, who or what was affected by the action,
what are the attributes of the various actors, actions, and objects,
and so on. It is not the case, however, that there is a one-to-one
mapping between the syntactic components and the semantic components.
For example, in the sentence
I saw the man on the hill with a telescope.
it is easy to pick out the two prepositional phrases, but it is unclear
as to whether the man on the hill had a telescope, or a telescope was
used to see him. Thus, while structure constrains meaning, it alone
does not determine meaning. This is just one example of the quality
that makes natural language understanding by computer a difficult problem--
that quality called AMBIGUITY.
Parsing...
...(or syntactic analysis) is the process of decomposing a sentence
into its components, and verifying that the syntactic structure is correct.
Parsing needs two things:
A grammar or other formal specification of allowable structures
in the language (i.e., the structural rules of the language)
A parsing technique or procedural method for analyzing the
sentence (i.e., a means of applying or using the rules mentioned above)
What's a grammar?
A grammar contains the knowledge about "legal" syntactic structures,
represented as rewrite rules. The grammar defines the language.
Here's a very simple example:
S S1 ---------> S2
(This notation is a little bit different than we used in class, because
it's just too darn hard to draw circles in ASCII. For homework and exams,
use the notation in the book or what we use in class. Don't use this
notation.)
The remainder of the rules would look like this as TNs:
pop
/
VERB NP /
VP0 ---------> VP1 ---------> VP2
pop
/
ART NOUN /
NP0 ---------> NP1 ---------> NP2
pop
/
NAME /
NP3 ---------> NP4
pop
/
POSS NOUN /
NP5 ---------> NP6 ---------> NP7
We can consolidate those noun phrase TNs into a single TN that might
be a little bit easier to deal with:
ART NOUN
---------> NP1 ---------
/ ^ \
/ / \ pop
/ / \ /
/ POSS / _\|/
NP0 ----------- NP2
\ ^
\ /
\ /
\ NAME /
------------------------
(OK, so it's ugly here, but on paper or on the whiteboard, it looks a
lot better.)
So now our TN looks like this:
pop
/
NP VP /
S0 ---------> S1 ---------> S2
pop
/
VERB NP /
VP0 ---------> VP1 ---------> VP2
ART NOUN
---------> NP1 ---------
/ ^ \
/ / \ pop
/ / \ /
/ POSS / _\|/
NP0 ----------- NP2
\ ^
\ /
\ /
\ NAME /
------------------------
This gives us a nice modular set of transition nets. We start with the
S transition net at state S0, and test the input to see if we have a noun
phrase. But to do that test, we need to jump to the NP net at state NP0,
and then test for the various possibilities. If we find that we recognize
a noun phrase (i.e., we've made it all the way to state NP2), then we jump
back to the S net at state S1 and continue from there. In order to do this
jumping and returning, we need to store return points on a stack, so you
can just pretend that there's an implicit stack hanging around somewhere
that allows you to do this. (A finite state machine with a stack is called
a pushdown automaton, or a PDA.)
We could avoid the use of a stack by making one big transition net:
ART NOUN ART NOUN
------> S1 ------- ------> S4 -------
/ ^ \ / / \
/ / \ / / \ pop
/ / \ / / \ /
/ POSS / _\| VERB / POSS / _\| /
S0 ------- S2 -------> S3 ------- S5
\ ^ \ ^
\ / \ /
\ / \ /
\ NAME / \ NAME /
------------------ -------------------
But now we've duplicated the NP net, and that's completely undesirable
for the same reason that duplicating chunks of code in real live
programs is undesirable. So take advantage of the ability to organize
your nets in cohesive modules and rely on the stack to allow you to jump
from module to module, and you'll be doing fine. (In other words, the
net immediately above sucks. The group of three nets for S, NP, and VP
is much better.)
What happened to these rules?:
NAME