Grammars and Parsing

To describe several types of formal grammars for natural language
processing, parse trees, and a number of parsing methods, including
a bottom-up chart parser in some detail.
We show the use of Prolog for syntactic parsing of natural language
text. Other issues in parsing, including PP attachment, are briefly
discussed.

Predictive Parsing Example 2

Bottom-up parsing

The dogs barked

→ DET N V

→ NP V

→ NP VP

→ S

Parsing Efficiency

Using bottom-up parsing methods, all CFGs can be
parsed in n3 steps, where n is the length of
the sentence, whereas predictive parsing can take exponential time
(i.e. much slower).
The reason predictive parsing may
take exponential time is that it may re-parse
pieces of the sentence, particularly confusing
sentences (like the horse [that was] raced
past the barn fell ):

Chart Parsing

The chart is a record of all the substructures (like past the barn)
that have ever been built during the parse.

A chart is sometimes also
called a well-formed substring table.

Chart Parsing Advantages

Charts help with "elliptical" sentences:

1: Q. How much are apples?2: A. Thirty cents each.3: Q. Plums?

An attempt to parse 3 as a sentence fails, but all is not lost, as the analysis
of plums as an NP is on the chart.

Successful parsing of the entire
utterance as any kind of structure can be useful.

A Bottom-Up Chart-Based Parsing Algorithm

parses a sentence of length
N within N3 steps.

Does better than this
(N2 or N steps) with well-behaved
grammars.

constructs phrasal or
lexical constituents of a sentence.

use the sentence the green fly
flies as an example.

Annotate the sentence with positions:
0the1green2fly3flies4.

Chart Parser 2

The parsing process succeeds if an S (sentence)
constituent is found covering positions 0 to 4.

Operations (2) to (9) below do not completely specify the order in which parsing
steps are carried out: one reasonable order is

scan a word (as in (3))

perform all possible parsing steps as specified
in (4) - (7) before scanning another word.

Parsing is completed when the last word has been read
and all possible subsequent parsing steps have
been performed.

Chart Parser 3

Parser inputs: sentence, lexicon, grammar.

Parser operations:

The algorithm operates on two data structures:
the active chart - a collection of active arcs
(see (4) below) and the constituents (see (3)
and (6)). Both are initially empty.

The grammar is considered to include lexical
insertion rules: for example, if fly is a
word in the lexicon/vocabulary being used, and if its
lexical entry includes the fact that fly may
be a N or a V, then rules of the form N →
fly and V → fly
are considered to be part of the grammar.

Chart Parser 4

As a word (like fly) is scanned,
constituents corresponding to its lexical categories
are created:

N1: N → fly FROM 2 TO 3, and

V1: V → fly FROM 2 TO 3

If the grammar contains a rule like
NP → DET ADJ N, and a constituent like
DET1: DET → the FROM m TO n
has been found, then an active arc

ARC1: NP → DET1 •
ADJ N FROM m TO n

is added to the active chart. (In our
example sentence, m would be 0 and n would be 1.)
The "•" in an active arc marks the boundary between
found constituents and constituents not (yet) found.

Chart Parser 5

Advancing the "•": If the active chart has an active arc like:

ARC1: NP → DET1 • ADJ N FROM m TO n

and there is a constituent in the chart of type ADJ (i.e. the first
item after the •), say

ADJ1: ADJ → green FROM n TO p

such that the FROM position in the constituent matches the
TO position in the active arc, then the "•" can be advanced,
creating a new active arc:

ARC2: NP → DET1 ADJ1 • N FROM m TO p.

DET ADJ N" >

Chart Parser 6

If the process of advancing the "•" creates an active arc whose "•"
is at the far right hand side of the rule: e.g.

ARC3: NP → DET1 ADJ1 N1•
FROM 0 TO 3

then this arc is converted to a constituent.

NP1: NP → DET1 ADJ1 N1 FROM 0 TO 3.

Not all active arcs are ever completed in this sense.

Chart Parser 7

Both lexical and phrasal constituents can be used in steps 3 and 4:
e.g. if the grammar contains a rule S → NP VP, then as soon
as the constituent NP1 discussed in step 5 is created, it will be
possible to make a new active arc

ARC4: S → NP1 • VP FROM 0 TO 3

Chart Parser 8

When subsequent constituents are created, they would have names like NP2,
NP3, ..., ADJ2, ADJ3, ... and so on.

The aim of parsing is to get phrasal constituents (normally of type S)
whose FROM is 0 and whose TO is the length of the sentence. There may be
several such constituents.

This assumes a minimal grammar and lexicon
- if "green" could be a N (noun),
"fly" a V (verb), and/or
"flies" a N (noun), then there would be more
lexical constituents, for example. The actual grammar rules
used above were NP → DET ADJ N; VP → V; and S → NP VP.

Notes on Chart Parsing

disadvantage of bottom-up method: will find irrelevant constituents
like the VP hold the water which would not be noticed by a top-down
parser, because it wouldn't be looking for (the start of) a VP at that point;

a top-down CFG parser can have a chart (but you have to keep track
of which constituents are only hypothesized, and which have actually
been substantiated by the text being parsed);

mixed-mode parsers have best aspects of both methods (disadvantage - more
complicated)

Recording Sentence Structure

A frame-like, or slot/filler representation works: John fed the numbat

Limitations of Syntax in NLP

it is reasonable to ask for syntactically correct programs, but
unrealistic to ask for syntactically correct NL. Written NL material is
sometimes correct, but spoken utterances are rarely grammatical. NL
systems must be syntactically and semantically robust.

some approaches have sought to be semantics-driven, to avoid the
problem of how to deal with syntactically ill-formed text. However, some
syntax is essential - else how do we distinguish between Cyril loves
Audrey and Audrey loves Cyril?

Summary: Grammars and Parsing

There are many approaches to parsing and many grammatical
formalisms. Some problems in deciding the structure of a sentence
turn out to be undecidable at the syntactic level. We have
concentrated on a bottom-up chart parser based on a context-free
grammar. We will subsequently extend this parser to augmented
grammars.