Prospects for in-depth story understanding by computer

by Erik T. Mueller

November 29, 1999

Abstract
While much research on the hard problem of in-depth story
understanding by computer was performed starting in the 1970s,
interest shifted in the 1990s to information extraction and word sense
disambiguation. Now that a degree of success has been achieved on
these easier problems, I propose it is time to return to in-depth
story understanding. In this paper I examine the shift away from story
understanding, discuss some of the major problems in building
a story understanding system, present some possible solutions
involving a set of interacting understanding agents, and
provide pointers to useful tools and resources for building story
understanding systems.

Introduction

In 1976 John McCarthy wrote a memo discussing the problem of getting
a computer to understand the following story from the New York
Times:

A 61-year old furniture salesman was pushed down the shaft of a freight
elevator yesterday in his downtown Brooklyn store by two robbers while
a third attempted to crush him with the elevator car because they were
dissatisfied with the $1,200 they had forced him to give them.

The buffer springs at the bottom of the shaft prevented the car from
crushing the salesman, John J. Hug, after he was pushed from the first
floor to the basement. The car stopped about 12 inches above him as he
flattened himself at the bottom of the pit.

...
(McCarthy, 1990, p. 70)

In 1999 the problems raised by McCarthy remain unsolved. No program
can understand the above story at a level anywhere near that of a
human reader. Only a limited understanding is currently possible.

Information extraction technology (Chinchor, 1999),
for instance, can be used to read the above story and fill a template
with information such as the following:

Given a particular class of events such as terrorist incidents, programs
can be built that extract basic information such as the type of incident,
perpetrators, and targets. Information extraction technology was
refined during a series of DARPA-supported Message Understanding
Conferences (MUCs) beginning in 1987 (Grishman & Sundheim, 1996).

Word sense disambiguation technology (Ide & Véronis, 1998)
can be used to assign a sense (meaning) to each ambiguous word in a text.
Various word sense disambiguation methods have been developed that can read
the above story and determine that:

elevator refers to a lift in a building rather
than the control surface on the tail of an airplane, and

store refers to a shop rather than a storehouse or
memory device.

Given the heavy ambiguity of natural language, word sense
disambiguation is essential to understanding. The first conference for
evaluating word sense disambiguation systems was held
in 1998 (Kilgarriff, Palmer, Rosenzweig, & Tiberius, 1998).

But the type of understanding McCarthy was concerned with is much
deeper than information extraction and word sense disambiguation. He
proposed that a computer should be able to demonstrate its
understanding by answering questions about the above story such as:

Who was in the store when the events began?
Who was in the store during the attempt to kill Mr. Hug?
Who had the money at the end?
What would have happened if Mr. Hug had not flattened himself at the bottom
of the pit?
Did Mr. Hug want to be crushed?
Did the robbers tell Mr. Hug their names?
Did Mr. Hug like the robbers, and did they like him?
What would have happened if Mr. Hug had tried to run away?
What can Mr. Hug do to avoid this in the future?
Did Mr. Hug know he was going to be robbed?
Does he know that he was robbed?
How did the robber try to crush him with the car?
How long did the events take?
What crimes were committed?
...
(excerpted from McCarthy, 1990, pp. 70-71)

These questions are beyond the state of the art.

In the 1970s and 1980s a number of researchers worked on the hard
problem McCarthy discusses, which I refer to as in-depth story
understanding. What happened to this research?

Charniak (1972) built a program for understanding simple children's
stories and continued to work on story understanding well into the
1980s, developing marker passing techniques for explaining the
motivations of characters. But he then shifted his focus to
statistical techniques for syntactic parsing and word sense
disambiguation (Charniak, 1993), apparently having run out of steam on
story understanding. In 1986 he wrote:

The problem with marker passing is that it is not obvious if it can do
the job of finding important inferences in a very large and
interconnected database. ... Indeed, [the program] finds a lot of
garbage.

Beginning in the 1970s, Roger Schank and his academic descendants
(see Belew, 1999) built
a number of programs for understanding and
answering questions about simple stories (Lehnert, 1978; Schank &
Riesbeck, 1981; Dyer, 1983) by tracking typical situations (scripts)
and the plans and goals of story characters (Schank & Abelson, 1977).
But by the early 1990s, many Schank descendants had drifted away from the
story understanding problem. Lehnert (1994) provides an interesting
account of her own evolution toward information extraction:

SAM [Script Applier Mechanism] at this time [1975] only knew about one
script (whereas people have hundreds or thousands of them). ... In
truth, SAM was carefully engineered to handle the input it was
designed to handle and produce the output it was designed to produce. ...

SAM was just a prototype. As such, it didn't have to be robust or
bullet proof. ...

As a graduate student, I was repeatedly reassured that it was
necessary to walk before one could hope to run. Robust systems are
nothing to worry over when you're still trying to master the business
of putting one foot in front of the other. When I became a professor
I said the same thing to my own students, but with a growing unease
about the fact that none of these robustness issues had budged at all
in 5 or 10 or 15 years. ...

By the time we got into the 90s, portability and scalability had
become legitimate issues for basic research. In particular, our
government sponsors were increasingly preoccupied with the question of
whether or not certain technologies would scale up. It was within this
atmosphere that the series of Message Understanding Conferences were
first conceived. ...

I'd say that the UMass/MUC-4 system understands terrorism about as
well as SAM understood restaurants. ... On the other hand, UMass/MUC-4
runs rings around SAM when it comes to robustness and generality. ...

My guess is that researchers abandoned story understanding for a
variety of reasons. Some decided they would rather succeed on an easy
problem than fail on a hard problem. Some got tired of the topic or ran
out of ideas. Some did not spend enough time coding and debugging and thus
were never able to realize their ideas. Some became distracted by
easier subproblems. Some wished to develop immediate business
applications. Funding pressures and changing fashions played a large
role. Whereas in 1983, Dyer could write that "a simple metric, such as
'this program correctly reads all stories 100 words or less in length'
is unattainable" (p. 18), by 1990 measurement in natural language
processing was all the rage. Though performance measurement is
certainly useful for evaluating and comparing systems, it can also
skew research. As Shieber (1994) puts it:

Any behavioral test that is sufficiently constrained for our current
technology must so limit the task and domain as to render the test
scientifically uninteresting. ... The trend among [MUC] entrants over
the last several conferences has been toward less and less
sophisticated natural-language-processing techniques, concentrating
instead on engineering tricks oriented to the exigencies of the
restricted task—keyword-spotting, template-matching, and similar
methods. In short, this is because such limited tests are better
addressed in the near term by engineering (building bigger springs)
than science (discovering the airfoil).

With some exceptions (see, for example, Domeshek, Jones, & Ram, 1999), little
research on in-depth story understanding has been performed in the last decade.
Work has instead concentrated on the easier tasks of information
extraction and word sense disambiguation. (I have even built an
information extraction system [Mueller, 2000b].)
Now that research on these tasks has achieved some success, and indeed
may be reaching a point of diminishing returns (for example Ide & Véronis
[1998] write about word sense disambiguation that "we may have nearly reached
the limit of what can be achieved in the current framework" [p. 27]), I
propose it is time to return to the harder problem of story
understanding.

In the remainder of this paper, I present one approach for building
an in-depth story understanding system.
First, I discuss some of the major problems
that arise in attempting to build a story understanding system, and
some possible solutions. Second, I provide pointers to tools and
resources that can be used when building a story understanding system,
including part-of-speech taggers, syntactic parsers, lexicons, and
commonsense knowledge bases. These tools were not widely available
when the story understanding problem was first attacked in the 1970s.
My focus in this paper is on understanding the English language.

The hard understanding problem

How does a human understand language and how do we get a computer to
understand language? Of course these are separate problems. My primary
concern is with getting a computer to understand language using
whatever techniques we can invent. Yet we know humans can understand
language, so even an inkling of how humans do this can be helpful for
building understanding computers.

In this section I present more problems than solutions, but that is
the challenge of building a story understanding system.

Managing complexity

To get a computer to understand stories, we have a large problem on
our hands. By understanding I do not mean simply generating parse
trees, disambiguating words, or filling templates, but being able
to answer arbitrary questions, generate paraphrases and summaries,
fill arbitrary templates, make inferences, reason about the story,
follow reasoning in the story, relate the story to general knowledge,
hypothesize alternative versions of the story, look back over the
story, and more. Thus our first problem is:

How, as human programmers, do we manage the
complexity of building a language understanding system?

Hobbs, Stickel, Appelt, and Martin (1993) claimed that "issues of
modularity simply go away" (p. 115) in their abduction-based story
understanding system in which all linguistic and world knowledge is
uniformly represented as predicate calculus axioms. But this approach
suffers from combinatorial explosion problems (Domeshek, Jones, & Ram,
1999, pp. 93-94).

Minsky (1986, pp. 291-299) has hypothesized that the mind
understands language simultaneously in multiple realms of
thought. Each realm is concerned with a different aspect of the world,
such as the physical, personal, or mental. He proposes that "the
agencies within each realm can simultaneously apply their own methods
for dealing with the corresponding aspect of the common topic of
concern" (p. 294).

A growing body of research (Hirschfeld & Gelman, 1994) supports the
view that much of human cognition is domain-specific. That is, humans
have distinct explanatory frames, skills, and mechanisms for areas
such as numbers, substances, middle-sized objects, physical processes,
living kinds, artifacts, kinship categories, mental states, and moral
beliefs.

This suggests a strategy: Take one area and build an understanding
agent that is an expert in understanding that area. Then build a
UA that "smashes" the next area (solves it well), and so on, for many areas.

The mind may not divide up the world in the same way as our UAs.
But the division into UAs makes it easier for the mind of the programmer
to cope with the complexity of the task and build the system.

What areas do we need UAs for? Let's consider what realms are
involved for some of the questions McCarthy listed:

Who had the money at the end?
The realm of coercion (since the story stated that the robbers
forced Mr. Hug to give them money) and the realm of
possession (give).

What would have happened if Mr. Hug had not flattened himself at
the bottom of the pit?
The physical realm.

Did Mr. Hug want to be crushed?
The realm of human needs and goals (pain avoidance and survival).

Did the robbers tell Mr. Hug their names?
The realm of communication (tell) and the legal and criminal
realms (that robbery is illegal and robbers do not want to get
caught).

Did Mr. Hug like the robbers, and did they like him?
The emotional realm.

What would have happened if Mr. Hug had tried to run away?
The emotional and criminal realms (that robbers are often armed and
sometimes harm or kill).

Did Mr. Hug know he was going to be robbed? Does he know that he
was robbed?
The mental realm.

How did the robber try to crush him with the car? The
physical realm and the device realm (that buttons or levers control an
elevator).

In the above cases it is fairly clear what realms are
involved. However, deciding what realm a concept falls into, and what
UA should handle it, is not always easy:

Should Thea peeked around the tree be handled by the physical UA
or the eyesight UA?

Does How long did the events take? involve the
"temporal realm"?
In order to make an overall judgment about duration, an understanding
of the story in many realms is needed: the physical realm (how long it
takes to walk from one part of the room to another), the device realm
(how long it takes an elevator to descend), the criminal realm (how
long a robbery takes), the personal realm (when someone comes to
help), and so forth.

Is Who was in the store when the events began? in the
physical realm? What about the "salesman realm"? (A salesperson works
in a store. Sometimes there are customers in the store and sometimes
there are not.) The "robber realm"? (Robbers prefer to strike when
there are fewer people present. People might attempt to prevent the
robbery or notify the police.)

Thus our next problem is:

What is the "right" division? How do
we choose the modules of the system, and once they are chosen, how do we decide
where any given piece of functionality should go? Can the understanding
task really be modularized? Will this even work technically?

For example, the physical UA makes the inference that Mr. Hug might be
crushed if the elevator descends. That Mr. Hug makes a similar
inference and knows he might be crushed is an inference made by the
mental UA. That Mr. Hug then has the goal to avoid being crushed is an
inference made by the human need/goal UA. These UAs seem hopelessly
intertwined. It seems futile to break the problem into UAs, since each
UA ends up redoing the work of (and having the same expertise as)
other UAs.

The answer is that UAs are not isolated entities. Rather, UAs depend
on each other since information in one UA may be essential to another
UA's understanding. Further, UAs need to come to some sort of
agreement on an interpretation. We would like to avoid major
inconsistencies where one UA thinks Mr. Hug is stuck in an elevator
shaft while another thinks he is selling furniture to a customer.

Managing interaction

How do UAs interact and how do we manage their interaction? This is
our next problem:

How do the modules of a language
understanding system provide information to one another, make use of this
information, and settle on a unified interpretation?

Minsky (1986, p. 294) has proposed a type of agent called a
paranome that operates simultaneously on agents in several
different realms. In understanding Mary gives Jack the kite,
an origin paranome forces agents in the physical, possessional, and
psychological realms to consider and agree on the origin of the
transfer of the kite. A destination paranome forces agents to consider
and agree on the destination of the transfer.

What do the UAs need to agree on when processing the newspaper
story? Let's consider some of the alternative hypotheses that
might be made by the UAs. Hypotheses that will eventually be
ruled out are marked with an asterisk.

personal UA for S

S is furniture salesman
S works at furniture store
S is 61 years old

personal UA for R

R is robber

physical UA

*elevator = control surfaceelevator = liftpush = physically push

R pushes S to elevator

R pushes S down elevator shaft

S falls

crush = compress

R tries to crush S with elevator

S is between bottom of elevator and ground

Elevator descends

S flattens himself

Elevator hits buffer springs

Elevator stops

give = hand over

S opens cash register

S grabs $1200

S moves hand to R's hand

S releases $1200

R grabs $1200

*forced = physically force

*R grabs $1200 from S

store = shop

Store is in downtown Brooklyn

Elevator is in store

human need/goal UA for S

Elevator descends

S wants to stay alive

S wants to avoid being crushed

Elevator stops

S avoids being crushed

S wants to keep his job

human need/goal UA for an armed robber R

R is armed
R is not afraid to kill
R wants lots of money

R wants to rob furniture store

coercion UA

forced = coerce

R forces S to give R $1200

R threatens S with gun

*push = impel

*R impels S to go down elevator shaft

device UA

*store = memory device
*elevator = control surfaceelevator = lift

R pushes buttons on elevator

Elevator descends

*car = motor vehiclecar = part of lift

emotion UA for S

S is fearful of R

emotion UA for R

R is dissatisfied with $1200

R is angry at S

possession UA

give = transfer of possession

S gives R $1200

R has $1200

competition UA

*crush = defeat

Each UA may entertain several alternative hypotheses. For example, the
physical UA considers the details of how the robbers obtained the
money: Perhaps the robber physically forced the money from the
salesman, or perhaps the salesman handed the money to the robber.

Alternatives arise for a number of reasons: Words in the story may
have multiple parts of speech (force may be a noun or verb),
words may have multiple senses (give may mean any transfer of
possession as well as physically handing over and force may
mean physical force as well as coercion), sentences may have
alternative syntactic parse trees, pronouns may have several possible
referents, and so on. Leaving many details unspecified is the norm for
natural language.

The UAs need to coordinate and settle on an overall shared
interpretation. How is this done?

In general there are hard and soft constraints on the shared
interpretation. Constraints may be within a single UA or across UAs.
A hard constraint is that the device and physical UAs must agree on
whether the sense of elevator is control surface or lift. A
soft constraint is that "S is fearful of R" is correlated with
"R tries to crush S with elevator."

physical UA:
14. Table is in kitchen
15. Table is in dining room
16. Table is in office

The hard constraints are:

One sense per word:
xor(1,2)
xor(3,4)
xor(5,6,7)

One goal:
xor(9,12,13)

One location:
xor(14,15,16)

The soft constraints are:

Placing a glass of milk on the dining table agrees with eating a meal:
and(1,3,5,12)

Placing a glass of milk on the dining table or desk agrees with drinking:
and(1,3,or(5,6),9)

Allowing milk to jell on the table agrees with doing something strange:
and(2,4,5,13)

Goals agree with their motivation:
and(8,9)
and(or(10,11),12)

Dining table agrees with kitchen:
and(5,14)

Dining table agrees with dining room:
and(5,15)

Desk agrees with office:
and(6,16)

Reasonable solutions are:

set = placemilk = glass of milktable = dining table
It is mealtime
Jim wants to eat meal
Table is in dining room

or:

set = jellmilk = milk by itselftable = desk
Jim wants to do something strange
Table is in kitchen

Solutions may be found in a number of ways: UAs can negotiate directly
with one another or through intermediary paranomes. Ferber (1999)
discusses protocols for collaboration in a society of agents. A
weighted maximum satisfiability (MAX-SAT) algorithm (see, for example,
Jiang, Kautz, & Selman, 1995) might be used.

Each UA can use whatever representations are best for understanding
its realm. The physical UA might use a combination of spatial
propositions, 2-dimensional grids, and 3-dimensional models. The
emotion UA might use analog vectors. But in order for UAs to
communicate, they must share a language.

If one UA is unsure about two alternatives, it might request the
opinion of a UA that is an expert in the area. The trouble is, the UA
might not even know the right question to ask the expert UA. It might
not even realize the two alternatives exist.

A UA needs to be able to link a part of its representation to a part
of a representation in another UA. That is, just as there can be
causal links within a UA such as:

R is angry at S BECAUSE
R is dissatisfied with $1200
(in the emotion UA for R)

there can be causal links across UAs:

R is dissatisfied with $1200 (in the emotion UA for R) BECAUSE
S gives R $1200 (in the possession UA)

R pushes S to elevator (in the physical UA) BECAUSE
R is angry at S (in the emotion UA for R)

Linked representations are not necessarily in the same language,
though a common language must be used to set up the link.

Managing updates

As new inputs are processed, the solution needs to be updated.
This is our next problem:

How does the system update its
interpretation as new inputs are received? How are new inputs related to
previous inputs? How does the system again settle on a unified
interpretation?

Each input provides new information that must be incorporated into the
current understanding, and also provides new constraints on the
interpretation. Norman (1982) describes an experiment that
demonstrates how readers update their hypotheses about what the
story is about after reading each sentence:

Dear little thing.
letter, pet

It was nice to feel it again.
pet, toy, indefinite

She had taken it out of its box that afternoon, given it a good
brush and rubbed life back into the dim little eyes.
pet, toy

Little rogue! Yes, she really felt that way about it.
pet, toy

She put it on.
clothing, fur

Little rogue, biting its tail just by her left ear.
clothing, fur

When she breathed, something gentle seemed to move on her
bosom.
fur

The day was cool and she was glad she had decided on her little
fur.
fur
(adapted from pp. 92-94)

The UAs must negotiate an agreement about how to relate and merge
events in the discourse: For example, the second sentence of the
newspaper story states that Mr. Hug was pushed from the first floor to
the basement. The physical UA must realize that this is the same event
as the one mentioned in the first sentence: that Mr. Hug was pushed
down the shaft of a freight elevator. The physical UA has enough
knowledge of the physical world to make this connection, but other UAs
may not.

Managing possibilities

How long should UAs hold onto possibilities? That is:

How many alternative interpretations does
the language understanding system maintain? When does it drop
possibilities?

The trouble with keeping possibilities is that every possibility leads
to more possibilities and we get a combinatorial explosion. The
trouble with dropping unlikely possibilities is that those
possibilities may later prove correct.

There are two ways the system could recover from a poor choice about
what to drop: The system could fix the current interpretation
on-the-fly, or it could back up to an earlier point in the story and
reread (as humans sometimes do).

Managing detail

The next problem is:

How much detail does the language
understanding system compute while reading the story? How much does it compute
on demand during question answering (and other tasks)?

The more we compute as we read, the faster we can answer questions.
The less we compute as we read, the faster we can read.

For example, some of the questions about the newspaper story can be
answered based on information that was computed by the UAs while reading
the story:

Who had the money at the end?
possession UA: R has $1200

Did Mr. Hug want to be crushed?
human need/goal UA for S: S wants to avoid being crushed

How did the robber try to crush him with the car?
physical UA: S is between bottom of elevator and ground
device UA: R pushes buttons on elevator
device and physical UAs: Elevator descends

How long did the events take?
Durations are calculated by many UAs and these are constrained
to be in rough agreement.

Other questions require further reasoning:

What would have happened if Mr. Hug had not flattened himself
at the bottom of the pit?
The physical UA reasons that Mr. Hug may have been crushed.

What would have happened if Mr. Hug had tried to run away?
The emotional and human need/goal UAs reason about the likely behavior
of the robbers.

Does he know that he was robbed?
A mental state UA infers that he knew.

How much inferencing humans perform during reading is a topic of
considerable debate (McKoon & Ratcliff, 1992), but it seems clear that
we perform varying amounts of processing depending on the situation:
When we glance at a story our understanding is similar to that of an
information extraction system (incident: robbery, location: Brooklyn).
When reading for pleasure we imagine the details of the setting. When
reading a mystery we might engage in an elaborate stream of thought
trying to crack it. How much energy we devote depends on our reasons
for reading the story, our mood, and other factors. An in-depth story
understanding system should have the same flexibility.

We can set a baseline level of understanding for the system by
providing a list of questions that each UA should be able to answer
quickly. (These baseline questions could eventually be used in a
MUC-like evaluation for story understanding systems.) The baseline
questions for the human need/goal UA for a person P are:

Why did P perform an action or cause a state?
How did P react to an action or state?
What did P expect to happen when performing an action or causing a state?
What was P's goal?
Did P's goal succeed or fail?

The physical UA should create physical models (propositional, 2-d, and/or
3-d) for the events in the story and be able to answer the questions:

Where was someone/something?
Where did someone/something go?
Where did someone/something come from?
How did someone/something move?
Who/what was near/in/on/... someone/something?
Where was someone/something in relation to someone/something else?

The baseline questions for the possession UA are:

Who had something before an action?
Who had something after an action?
Who transferred something to someone?

The baseline questions for the competition UA are:

Were the goals of person1 and person2 in concord or conflict?
What were those goals?
What was the outcome?

Other baseline questions that involve multiple UAs are:

What time of day was it?
How long did something take?
Why did an action or state occur?
Who/what performed an action or caused a state?
What is the theme of the story?

Managing processing

A deeper level of understanding is possible if the system asks itself
questions and tries to answer them (Ram, 1994). The answers lead to
further questions and answers. The system considers alternatives and
the consequences of those alternatives. Eventually it decides what it
thinks happened.

A problem that arises is how to control this process. While humans
have an amazing ability to cut straight to the heart of the matter,
computers often get lost. That is:

How does the language understanding system
decide what to consider next? How does it decide what not to
consider? How does it know when it has reached a stumbling
block? How do we trigger and direct processing?

If the system considers alternatives, and those alternatives lead to yet
other alternatives, the system easily gets bogged down. When reading
the newspaper story, it has a number of opportunities to get lost
in fruitless lines of thought:

What was the layout of the store?

Was the elevator in the same room as the cash register?

Maybe there were two rooms.

Maybe there were three rooms.

Maybe there were four rooms.

...

Why did the robbers push him down the elevator shaft?

Were the robbers unarmed?
...

How did Mr. Hug get into the furniture business?

Maybe his family was in the business.
...

Why didn't somebody call the police?

Why was nobody in the store?

Maybe it was the weekend.

Maybe it was 8 pm.

Maybe it was 9 pm.

Maybe it was 10 pm.

...

Why was a small amount of money in the cash register?

Maybe they had just taken cash to the bank.
Maybe the store was not doing well.

Why was the store not doing well?

Maybe their prices were too high.

Maybe they did not have a good selection.

...

Where do the robbers live?

Maybe they live in Brooklyn.
Maybe they live in the Bronx.
Maybe they live in Manhattan.
...

Whether the above streams of thought are in response to questions
generated by the system or questions provided as input, we have the
same problem: how to control processing.

Minsky (1986, p. 59) has proposed dividing the brain into an A-Brain and a
B-Brain. The A-Brain is hooked up to the external world and the
B-Brain watches and advises the A-Brain. He suggested the following
B-Brain rules:

A seems disordered and confused.A appears to be repeating itself.A does something B considers good.A is occupied with too much detail.A is not being specific enough.

Inhibit that activity.
Make A stop. Do something else.
Make A remember this.
Make A take a higher-level view.
Focus A on lower-level details.

Processing in the language understanding system needs to be carried
out by A-Brain tasks that are subject to control (starting, reducing
priority, stopping) by a B-Brain. In addition to the
domain-independent rules above, the B-Brain can contain rules
particular to the reading task:

If A is spending too much time considering alternative spatial
layouts and spatial layout is not crucial to the story, make A
stop.

If A is spending too much time on a task that does not relate
to the purpose of reading the story, make A stop.

If A is confused, tell A to go back and reread. A may
have ruled out a possibility earlier that needs to be reconsidered.

The basics of understanding

I now provide some pointers to tools and resources to support the UAs
described in the previous section. I review a number of off-the-shelf
programs, emphasizing what is readily available.

Recognizing textual entities

Various textual entities in the input text must be recognized.
Examples in the newspaper article are:

Words: was, pushed

Phrases: freight elevator, buffer springs

Times: yesterday

Places: downtown Brooklyn

Names: John J. Hug

Numbers: $1,200, 12 inches

These entities may be detected using various techniques. Regular
expressions and pattern matching are often used.
Apple Data Detectors (Nardi, Miller, & Wright, 1998) use context-free
grammars. Silberztein (1997) uses finite-state transducers for parsing
spelled out numbers such as four thousand five hundred twenty
three.

Reynar and Ratnaparkhi (1997) and Palmer and Hearst (1994) provide
methods for recognizing the ends of sentences.

Tagging

Taggers can be used to determine the most likely parts of speech for
each word in a sentence. The de facto standard tagger is Eric Brill's
(1994) freely available rule-based tagger. This tagger's accuracy is
about 97 percent. A best tagger provides a single guess per word
while an n-best tagger provides several guesses per word. The
result of running the Brill best tagger on the first sentence of the
news story is:

Silberztein (1997, pp. 198-201) argues against the use of tagging and
for exhaustive lexical analysis: Even with 97 percent accuracy per
word, the accuracy for a 10-word sentence is only 73.7 percent. If one
word in a sentence is incorrectly tagged, syntactic parsing is likely
to fail. Taggers also do not take into account phrases. Given
All of a sudden Jim set the French fries on the table,
the Brill tagger produces:

It makes life easier on the syntactic parser if all of a
sudden is treated as a single adverb and French fries as
a single noun. ThoughtTreasure (Mueller, 1998a, pp. 261-271) contains a
dictionary of inflections and mechanisms for inflectional and
derivational morphology. Technology for exhaustive lexical analysis is
also available from Teragram.

Syntactic parsing

A syntactic parser takes text or tokenized text such as
Jim set the milk on the table and converts it into parse trees
such as:

Many syntactic parsers have been built. Some that can be downloaded are:

Name

Formalism

English lexical entries

Link Grammar Parser (Sleator & Temperley, 1993)

link grammar

25,000

ThoughtTreasure 0.00022 (Mueller, 1998b)

CYK primordial soup

35,000

XTAG (XTAG Research Group, 1999)

FB-LTAG

105,000

Full syntactic parsing is often slow and subject to combinatorial
explosions resulting from ambiguity. For example, prepositional phrase
attachment ambiguity leads to an alternative parse of the above
sentence:

Partial parsing is therefore often used: Instead of returning a complete
list of all the possible parses of a sentence, the parser returns a list
of parse fragments (Bangalore, 1997).

Semantic parsing

A semantic parser converts input sentences or syntactic parse trees
into semantic representations such as frames or logical
formulas. These representations provide a surface-level understanding
and serve as a point of departure for a more in-depth understanding.

Given Jim set the milk on the table, the ThoughtTreasure semantic
parser (Mueller, 1998a, pp. 125-173) produces something like:

A semantic parser requires information about argument structure. For a
verb this consists of information about what noun phrases,
prepositional phrases, adjective phrases, and clauses can be provided
as arguments, and how those arguments map to thematic roles
(slots) of the output frame. For the verb set
the lexicon contains something like:

Lexicons containing broad coverage of argument structure and thematic
roles useful for semantic parsing are not widely available: Mitton's (1992)
lexical database contains argument structure information for 23,227 verbs
(of a total of 70,646 lexical entries). However, the database does not
distinguish word senses and does not specify argument structure in
detail. It indicates, for example, that a verb takes a prepositional
phrase argument, but not what particular prepositions are used.

WordNet (Fellbaum, 1998) uses 34 argument structure patterns and
associates each verb sense with one or more of these
patterns. However, detailed information such as what preposition is
used is not always provided. Kohl, Jones, Berwick, and Nomura (1998)
have extended WordNet to 226 argument structure patterns and entered
these patterns for 2,600 verbs, but they are not yet part of the
general release. These patterns also provide the mapping from arguments
to thematic roles.

Levin (1993) argued that the syntactic behavior of a verb is
determined to a large degree by its meaning. Levin classified 4,183
verbs into 191 classes with similar meaning and syntactic behavior.
This work is often consulted when building lexicons for semantic
parsers, but not used directly: In coding the argument structure for
4,000 verbs, Burns and Davis (1999) encountered a number of exceptions
and concluded that the classification "does not completely obviate
entering single words by hand" (p. 16). Kohl, Jones, Berwick, and
Nomura (1998) used the example sentences and annotated them in Prolog
from scratch instead of using the proposed classes.

A semantic parser also incorporates selectional constraints
that specify what types of arguments are permitted. Selectional
constraints may be on concepts, as in:

bark prefers a dog as a subjecteat prefers food as an object

Selectional constraints may also apply to specific words:

a yellow lemon is called yellow
yellow hair is called blond

Anaphora

A story understanding system must resolve various anaphoric entities
into the objects to which they refer. Examples of anaphoric entities
are pronouns (she, they), possessive determiners
(my, his), and arbitrary constructions involving:

relative clauses (the $1,200 they had forced him to give them,
the milk that fell on the floor).

Anaphora resolution is a difficult problem. ThoughtTreasure contains
tools for recognizing anaphoric entities and suggesting possible
referents (Mueller, 1998a, pp. 154-162). ThoughtTreasure partially
implements a method proposed by Huls, Bos, and Claassen (1995) for
determining the likelihood of each possible referent. Resolution of
anaphora is best performed by the UAs.

Integrated parsing

Dyer (1983) argued for integrated parsing in which all parsing
processes work in close cooperation. Integrated parsing is valid from
a cognitive standpoint since humans are able to understand in real
time as words are read or heard; there is no need to wait until the
end of a sentence in order to understand it. Integrated parsing also
reduces combinatorial explosion since incorrect possibilities can be
eliminated sooner rather than later.

However, integration is difficult to achieve since the right
information is not always available at the right time.

Suppose we want to integrate anaphor resolution into semantic
parsing. In order to resolve a pronoun, its antecedent must be
available. But only those antecedents from prior sentences are
available inside semantic parsing. Antecedents within the current
sentence might not yet be available, since those are the very concepts
that semantic parsing is in the process of constructing. (What
concepts are available inside semantic parsing depends on where one is
in the tree and the order of traversal of branches.)

Suppose we want to integrate semantic parsing into syntactic parsing:
A syntactic subtree may not contain all the information needed to
parse it. For example, it is more difficult to perform semantic
parsing of a verb phrase in isolation, since selectional constraints
on the unavailable subject cannot be used to disambiguate the verb
sense. Many subordinate clauses cannot be parsed in isolation either
since they inherit arguments from their superordinate clauses.

A solution to the syntax-semantics integration problem has been
proposed by Mahesh, Eiselt, and Holbrook (1999).

Commonsense knowledge bases

Several databases of commonsense knowledge are being built. Although
the eventual goal is to capture the common sense that humans have, the
emphasis so far has been on defining the atomic concepts of the world
such as car and motor-vehicle and relations on
those concepts such as:

ako: Cars are a kind of motor-vehicle.

isa: Boston is a city.

part-of: Cars have ignition-switches.

material-of: T-shirts are made of cotton.

used-for: Cars are used to drive.

used-at: Frying-pans are used in the
kitchen.

color-of: Grass is green.

size-of: Chairs are 3 feet tall.

duration-of: Plays last 2 hours.

typical-subject-of: Dogs bark.

typical-object-of: Food is eaten.

implies: Cry implies sad.

causes: Cheer-up causes happy.

The following table shows the commonsense knowledge bases that currently
exist and counts of the number of concepts and common relations:

Name

Concepts

ako/isa

part-of/material-of

Other

Cyc (Guha & Lenat, 1994)

30,000

25%

35%

40%

Cyc Upper Ontology 2.1 (Cycorp, 1998)

2,846

7,161

0

2,579

Mikrokosmos (Mahesh, 1996)

4,500

-

-

-

MindNet
(Richardson, Vanderwende, & Dolan, 1993)

45,000

47,000

14,100

32,900

SENSUS
(Knight & Luk, 1994)

70,000

-

-

-

ThoughtTreasure 0.00022
(Mueller, 1998b)

27,093

28,818

666

21,821

WordNet 1.6 (Fellbaum, 1998)

99,642

78,446

19,441

42,700

ThoughtTreasure also contains a database of 100 typical situations or
scripts (Mueller, 2000a).

A commonsense knowledge base is a useful resource for a story
understanding system: The unambiguous concept names can be used for
communication between UAs as well as within a single UA. The
relational information in the commonsense knowledge base can be used
by UAs to recognize and resolve ambiguities.

Most importantly:

The commonsense knowledge base can be
evolved along with the story understanding system: Whenever a piece of
commonsense knowledge would come in handy in the story understanding system,
it can be added to the database. The database can thus be grown to be
useful for the story understanding application.

The above databases have various advantages and disadvantages:
WordNet (Fellbaum, 1998) was designed as a lexical, rather than a
conceptual, database so it lacks links between words in different
syntactic categories. For example there is no link between the noun
creation and the verb create.

The accuracy of the relations in MindNet is only about 78 percent
(Richardson, Vanderwende, & Dolan, 1993) since it was built
automatically by parsing a machine-readable dictionary. The concept
names also cannot be relied upon since they were obtained by automatic
word sense disambiguation (Richardson, Dolan, & Vanderwende, 1998).

Summary

In summary, I propose the following plan for building an in-depth
story understanding system:

Feed input text to text agents for recognizing entities such as names
and places.

Perform lexical analysis on the input.

Use part-of-speech taggers and word sense disambiguators
to reduce possibilities.

Minsky, Marvin. (1986).
The society of mind.
New York: Simon and Schuster.

Mitton, Roger. (1992).
A description of a computer-usable dictionary file based on
the Oxford Advanced Learner's Dictionary of Current English.
London: Department of Computer Science, Birkbeck College,
University of London.