Concepts in the Lexicon

1. Problems and Issues

The lexical entry for a word must contain all the information needed to
construct a semantic representation for sentences that contain the word.
Because of that requirement, the formats for lexical representations must
be as detailed as the semantic forms. Simple representations, such as
features and frames, are adequate for resolving
many syntactic ambiguities. But since those notations cannot represent
all of logic, they are incapable of supporting all the function needed
for semantics. Richer semantics-based approaches have been developed
in both the model-theoretic tradition and the more computational
tradition of artificial intelligence. Although superficially in
conflict, these traditions have a great deal in common at a deeper level.
Both of them have developed semantic structures that are capable
of representing a wide range of linguistic phenomena.

1.1 Semantics from the Point of View of the Lexicon

To understand a semantic theory, start by looking at what goes into
the lexicon. In one of the early semantic theories in the Chomskyan
tradition, Katz and Fodor (1963) did in fact start with the lexicon.
Other theories, however, almost treat the lexicon as an
afterthought. Yet the essence of any semantic theory is still in the
lexicon: every element of the semantic representation of a sentence
ultimately derives from something in the lexicon. That principle is
just as true for Richard Montague's highly formalized grammar as for
Roger Schank's "scruffy" conceptual dependencies, scripts, and MOPs.

Besides the meanings of words, grammar and logic are necessary
to combine the meanings into a complete semantic representation.
But there are competing theories about how much grammar and logic is
necessary, how much is expressed in the lexicon, and how much is
expressed in the linguistic system outside the lexicon. Lexically based
theories suggest that the grammar rules should be simple and that
most of the syntactic complexity should be encoded in the lexicon.
Some linguists say that most of the syntactic complexity
isn't syntactic at all. It is the result of interactions among
the logical structures of the underlying concepts.
In his work on semantically based syntax, Dixon (1991) showed that
syntactic irregularities and idiosyncrasies
can be predicted from the semantics of the
words. Such theories imply that a language processor would only need
a simple grammar if it had sufficiently rich semantic structures.
The lexicon is the place where those semantic structures are stored.

A complete theory of semantics in the lexicon must also explain how
the semantics gets into the lexicon. A child could learn an initial
stock of meanings by associating prelinguistic structures with words.
But even those prelinguistic structures are shaped, polished, and
refined by long usage in the context of sentences. They are
combined with the structures learned from other words, and they are
molded into patterns that are traditional in the language and culture.
More complex, abstract, and sophisticated concepts are either learned
exclusively through language or through experiences that are highly
colored and shaped by language. For these reasons, the
meaning representations in the lexicon should be compatible
with the semantic representations for sentences.
As a working hypothesis, the two should be identical: the same
kinds of structures should be used to represent meanings in the lexicon
and to represent the semantics of sentences and extended discourse.
Simplified notations may be used for special purposes, but
they must be capable of being translated automatically
to the general semantic representations.

Although the lexicon is an important repository of semantic information,
it doesn't contain all the information needed to understand language.
Context and background knowledge are also important, since most
sentences cannot be understood in isolation.
Alfred North Whitehead (1941) gave the following example:

There is not a sentence which adequately states its own meaning.
There is always a background of presupposition which defies
analysis by reason of its infinitude. Let us take the simplest case;
for example, the sentence "One and one make two."

Obviously this sentence omits a necessary limitation. For one thing
and itself make one thing. So we ought to say, "One thing and another
thing make two things." This must mean that the togetherness of one
thing with another thing issues in a group of two things.

At this stage all sorts of difficulties arise. There must be the proper
sort of things in the proper sort of togetherness. The togetherness
of a spark and gunpowder produces an explosion, which is very unlike
two things. Thus we should say, "The proper sort of togetherness of
one thing and another thing produces the sort of group which we call
two things." Common sense at once tells you what is meant.
But unfortunately there is no adequate analysis of common sense,
because it involves our relation to the infinity of the Universe.

Also there is another difficulty. When anything is placed in another
situation, it changes. Every hostess takes account of this truth
when she invites suitable guests to a party; and every cook presupposes it
as she proceeds to cook the dinner. Of course, the statement,
"One and one make two" assumes that changes in the shift of circumstance
are unimportant. But it is impossible for us to analyze this notion
of "unimportant change." We have to rely upon common sense.

In fact, there is not a sentence, or a word, with a meaning which is
independent of the circumstances under which it is uttered.

Examples such as these contradict Frege's principle of
compositionality, which says that the meaning of a sentence is
derived from the meanings of the words in their syntactic combinations.
Yet context can also be stated in words and sentences.
Even when nonlinguistic circumstances are necessary for understanding
a sentence, the relevant aspects could be stated in a sentence.
For every one of his examples, Whitehead did exactly that.
An extended Fregean principle should therefore say that the meaning
of a sentence must be derivable from the meanings of the words in the
sentence together with the meanings of the words in the sentences
that describe the relevant context and background knowledge.
But as Whitehead cautioned, there is no way to predict in advance
what might be relevant.

1.2 Review of Lexical Representations

Monadic predicates, also known as features, properties,
or attributes, are one of the oldest and simplest knowledge
representations. They are the foundation for Aristotle's syllogisms
and modern frame systems and neural networks.
In his Universal Characteristic, Leibniz (1679) assigned
a prime number to each feature and represented compound concepts
by products of the primes. If Rational were represented by 2 and
Animal by 3, then their product 6 would represent Rational Animal
or Human. Such a representation generates
a lattice: concept A is
a subtype of B (A£B) if the number
for B divides the number for A; the minimal common supertype
(AÈB) corresponds
to their greatest common divisor; and the maximal common subtype
(AÇB) corresponds to their least common
multiple. Leibniz tried to use his system to mechanize Aristotle's
syllogisms, but a feature-based representation is too limited.
By themselves, features cannot represent quantifiers and negation or show
how the primitives that make up a compound are related to one another.
Some modern systems use bit strings instead of products of primes,
but their logical power is just as limited as Leibniz's system of 1679.

In their feature-based system, Katz and Fodor (1963) factored
the meaning of a word into a string of features and an undigested lump
called a distinguisher. Following is their
representation for one sense of the word bachelor:

In this definition, noun is the syntactic category; the
markers (Animal), (Male), and (Young) are the semantic features
that contain the theoretically significant information;
and the phrase in brackets is the unanalyzed distinguisher.
Shortly after it appeared, the Katz-Fodor theory was subjected to
devastating criticisms. Although no one today uses the theory in its
original form, those criticisms are worth mentioning because many
of the more recent approaches suffer from the same limitations:

The sharp distinction between semantic features and the distinguisher
is so fundamental to the theory that it should have an enormous impact on
the structures of language and the normal use of language. Yet there is
no linguistic evidence from syntax or cooccurrence patterns to indicate
that it has any effect whatever (Bollinger 1965).

The distinguisher is made up of words, each of which has its own
meaning. A complete semantic theory should explain how the meanings
of the words in the distinguisher contribute to the meaning of the whole.
But such an analysis would imply a deeper representation that underlies
both the features and the distinguisher.

The Katz-Fodor theory treats different senses of the same word as
if they were unrelated to one another.
Yet the four senses of bachelor have a great deal in
common. They all represent an immature or transitional stage that leads
to some further goal: a student who has completed an academic step
on the way to becoming a master or doctor; a young knight who is still
an apprentice to another; a seal on its way to full maturity as
a patriarch of the herd; or an unmarried man who has not yet started
to form his own family. The feature-distinguisher theory does not show
the commonality; it cannot explain how these meanings developed from a
common root or why they remain associated with the same word form.

If the features had no deeper structure, there would be nothing to
constrain their possible combinations. Yet certain combinations,
such as (Abstract)&(Color) or (Action)&(Weight), never occur.
More structure is needed in the theory to explain why such combinations
are impossible.

Finally, many features cannot be named with a single word.
Certain Mexican dialects, for example,
make a distinction between a difunto, a deceased person who
was married at the time of death, and an angelito, a deceased
person who was not married at the time of death (El Guindi 1986).
A feature such as (Married-at-the-time-of-death) is so blatantly
nonprimitive that it cries out for a theory that represents
deeper structures.

All these criticisms reflect the fundamental limitation of features:
monadic predicates cannot express relationships of two or more entities.
After Katz and Fodor factored out the features, their distinguisher
was left with all the combinations that required two or more links.

Despite their limitations, features can be used as slot fillers
in other combinatorial structures: conjunctions of features form
lattices, and weighted sums of features form neural networks.
But the methods for combining features make a significant difference
in the meaning of the results and the way they are used for reasoning.
Neural networks are good for classification, but they are opaque data
structures that are difficult or impossible to interpret by humans.
Leibniz's lattices with all combinations of features are easy
to understand, but they have too many useless or impossible nodes.
To generate lattices without the undesirable combinations,
Ganter and Wille (1999)
developed the theory of
formal concept analysis (FCA),
which can be used to construct lattices from the same input data used
for neural networks: collections of instances described by features.
Like neural networks, FCA lattices are good for classification, but they
are readable data structures that form the backbone of a type hierarchy.

Frames and specialized terminological languages are the next step
beyond features. Besides the monadic predicates used for features,
frames use dyadic predicates to connect the definiendum to slots,
which correspond to existentially quantified variables.
To improve efficiency, many such systems restrict the logical power
of the language by eliminating Boolean operators other than conjunction.
Yet such restricted logics cannot express all dictionary definitions.
The definition of penniless, for example, requires
a negation to express "not having a penny." The word bachelor
in Katz and Fodor's example requires temporal logic to express
the distinguisher [without a mate during the breeding time].
Doyle and Patil (1991) gave examples of terms that cannot be defined
without a richer set of operators:

For any application that requires such terms, there are only two
solutions: either leave the terms undefined or introduce dubious
"primitives" like without-a-mate-during-the-breeding-time.
For natural language understanding, semantic representations
must be able to express anything that people might say.
Since every logical quantifier, Boolean operator, and modal operator
occurs in dictionary definitions, a complete definitional language
must have the full power of logic.

In several articles written shortly before his death,
Richard Montague (1974) applied
model theory to natural language
semantics. He started with Carnap's notion (1947) that the meaning
or intension
of a sentence can be represented by a function from possible worlds
to truth values. To derive that function, he used syntax as a guide
to assembling the intension of a sentence from the intensions of the
words it contains. For each noun in the lexicon, the intension is
represented by a function that applies to some entity in the world.
The intension of the noun unicorn, for example, would be
a function that applies to entities in the world and generates the value
true for each unicorn and false for each nonunicorn.
Lexical categories other than nouns are represented by lambda expressions
that combine with the functions that represent neighboring words.
As an example, Montague's lexical entry for the word be is
a function that checks whether the predicate P
is true of the subject x. The idea is straightforward, but
the implementation leads to functions of functions that generate
other functions of functions of functions.
For Montague, the intension of be is a function d for
which the following axiom is true:

("x)("P)o(d(x,P) º
("y)(ext(y)=ext(x) É ("z)(zÎext(y) ÉP(z)))).

This axiom says that for any subject x and predicate
P, it is necessary that d is true of x and
P if and only if for any y whose extension is
equal to the extension of x and for any z in the extension
of y, the predicate P is true of z.
(This formula is actually a simplified restatement of Montague's
more terse and even more cryptic notation.)

Besides defining functions, Montague used them to solve
certain logical puzzles, such as Barbara Partee's example:
The temperature is ninety, and it is rising. Therefore, ninety
is rising. To avoid the conclusion that a constant like ninety
could change, Montague drew some subtle distinctions. He treated
temperature as an "extraordinary noun" that denoted
"an individual concept, not an individual." He also gave
special treatment to rise, which "unlike most verbs,
depends for its applicability on the full behavior of individual
concepts, not just on their extensions." As a result, he
claimed that The temperature is ninety asserted the
equality of extensions, but that The temperature is rising
applied the verb rise to the intension.
Consequently, the conclusion that ninety itself is rising would be
blocked, since rise would not be applied to the extension.

To linguists, Montague's distinction between words
whose semantics depend on intensions and those whose semantics
depend on extensions seemed like an ad hoc contrivance
with no linguistic evidence to support it.
To psychologists, the complex manipulations required for processing
the lambda expressions seemed unlikely to have any psychological reality.
And to programmers, the infinities of possible worlds seemed
computationally intractable. Yet for all its infelicities,
Montague's system was an impressive achievement:
it showed that formal methods of logic could be applied
to natural languages, that they could define the semantics of
an interesting subset of English, and that they could represent
logical aspects of natural language with the depth and precision usually
attained only in artificial systems of logic.

At the opposite extreme from Montague's logical rigor are Roger Schank's
informal diagrams and quasi-psychological theories that were never
tested in controlled psychological experiments. Yet they led his
students to build impressive demos that exhibited interesting language
behavior. As an example, the Integrated Partial Parser (Schank,
Lebowitz, & Birnbaum 1980) represents a fairly mature stage of Schank's
theories. IPP would analyze newspaper stories about international
terrorism, search for words that represent concepts in that domain,
and apply scripts that relate those concepts to one another.
In one example, IPP processed the sentence, About 20
persons occupied the office of Amnesty International seeking
better jail conditions for three alleged terrorists.
To interpret that sentence, it used the following
dictionary entry for the word occupied:

This entry says that occupied has interest level 5 (on a
scale from 0 to 10), and it is an event builder (EB)
of subclass scene event builder (SEB).
The template is a script of type $Demonstrate with slots for
an unknown actor, object, and demands. As its method, the demonstration
has a scene of type $Occupy with an unknown actor and location.
At the end of the entry are fill and request slots that give procedural
hints for finding the actor, object, location, and demands.
In using this template, IPP assigned phrases from the sample sentence
to the empty slots: "about 20 persons" fills the actor slot;
"the office of Amnesty International" fills the location slot; and
"better jail conditions" fills the demands slot.

The fill and request slots implement the Schankian expectations.
A fill slot is filled with something previously found in the sentence,
and a request slot waits for something still to come.
They serve the same purpose as Montague's rules for applying the function
associated with a verb to the functions for the subject on its left and
the object on its right. Schank's rules for filling slots correspond
to Montague's rules for expanding a lambda expression. The differences
in their terminology obscure the similarities in what they do:

Schank's antiformalist stance is irrelevant, since anything
that can be programmed on a digital computer could be formalized.
One Prolog programmer, in fact, showed that most of the slot filling
in Schank's parsers and script handlers could be done directly by
Prolog's unification algorithm. Techniques such as unification and
graph grammars could be used to formalize Schank's methods while making
major improvements in clarity, robustness, and generality.

Montague's appearance of rigor results from his use of Greek
letters and logical symbols. Yet some constructions, such as
his proposed solution to Partee's puzzle, are contrivances
that programmers would call "hacks." Montague was a lambda-calculus
hacker, an occupation that requires different training,
but the same kind of talent as a good computer programmer.

Schank and Montague had different attitudes about what aspects
of language were most important. Schank believed that the ability
to represent and use world knowledge is the essence of language
understanding, and Montague believed that the ability to handle
the scope of quantifiers and modalities was the most significant.
Both of them were right in believing that their favorite aspects were
important, but both were wrong in ignoring the others.

Schank and Montague represented different aspects of language with
different methodologies, but they are complementary rather than
conflicting. Wilks (1991) observed that Montague's lexical
entries are most complex for words like the, for which
Schank's entries are trivial. Conversely, Schank's entries are richest
for content words, which Montague treated as primitive functions
while ignoring their connotations. Logic and background knowledge
are important, and the lexicon must support both.

1.3 Metaphysical Baggage and Observable Results

Linguistic theories are usually packaged in metaphysical terms
that go far beyond the available evidence. Chomsky's metaphysics
may be summarized in a single sentence from Syntactic
Structures: "Grammar is best formulated as a self-contained
study independent of semantics." For Montague, the title and
opening sentence of "English as a Formal Language" express
his point of view: "I reject the contention that an important
theoretical difference exists between formal and natural
languages." Schank's outlook is summarized in the following sentence
from Conceptual Information Processing: "Conceptual
Dependency Theory was always intended to be a theory of how humans
process natural language that was explicit enough to allow for
programming it on a computer." These characteristic
sentences provide a key to understanding their authors' motivation.
Yet their achievements are easier to understand when the
metaphysics is ignored. Look at what they do, not at what they say.

In their attitudes and metaphysics, Schank and Montague are
irreconcilable. Montague is the epitome of the kind of logician
that Schank has always denounced as misguided or at best irrelevant.
Montague stated every detail of his theory in a precise formalism,
while Schank made sweeping generalizations and left the detailed
programming to his students. For Montague, the meaning of a sentence
is a function from possible worlds to truth values; for Schank, it is
a diagram that represents human conceptualizations. On the surface,
their only point of agreement is their implacable opposition
to Chomsky and "the developments emanating from the
Massachusetts Institute of Technology" (Montague 1970).
Yet in their reaction against Chomsky, both Montague and Schank
evolved positions that are remarkably similar, although their
terminology hides the resemblance. What Chomsky called a noun,
Schank called a picture producer, and Montague called a function from
entities to truth values. But those terms are irrelevant to
anything that they ever did: Schank never produced a single picture
or even stated a plausible hypothesis about how one
might be produced from his diagrams; Montague never applied
any of his functions to the real world, let alone the
infinity of possible worlds he so freely assumed.

In neutral terms, what Montague and Schank did could be described
in a way that makes the logicist and AI points
of view nearly indistinguishable:

Semantics, not syntax, is the key to understanding language.
The traditional grammatical categories are surface manifestations
of the more fundamental semantic categories.

Associated with each word is a characteristic semantic structure
that determines how it combines with other words in a sentence.

The grammar of a language can be reduced to relatively simple rules
that show what categories of words may occur on the right or the left
of a given word (the Schankian expectations or the cancellation rules
of Montague grammar). The variety of sentence patterns is not the
result of a complex grammar, but of the complex interactions between
a simple grammar and the underlying semantic structures.

The meaning of a sentence is derived by combining the semantic
structures for each of the words it contains. The combining operations
are primarily semantic, although they are guided by word order and
inflections.

The denotation of a sentence in a possible world is computed by
evaluating its meaning representation in terms of a model of that world.
Although Schank never used logical terms like denotation,
his question-answering systems embodied effective procedures for
computing denotations, while Montague's infinities
were computationally intractable.

Terms like picture producer or function from entities
to truth values engender heated arguments, but they have no effect
on the application of the theory to language, to the world,
or to a computer implementation. Without the metaphysical baggage,
both theories incorporate a semantics-based approach that is
widely accepted in AI and computational linguistics.

At the level of data structures and operations, there are significant
differences between Montague and Schank. Montague's representations
were lambda expressions, which have the associated operations of
function application, lambda expansion, and lambda contraction.
His metaphysics gave him a rigorous methodology for assigning each
word to one of his categories of functions (even though he never
actually applied those functions to any world, real or possible).
And his concerns about logic led him to a careful treatment
of quantifiers, modalities, and their scope.
Schank's representations are graphs on paper and LISP
structures of various kinds in his students' programs.
The permissible operations include any manipulations of those
structures that could be performed in LISP.
Schank's lack of a precise formalism gave his students the
freedom and flexibility to invent novel solutions to problems
such as the use of world knowledge in language understanding,
which Montague's followers never attempted to address.
Yet that lack of formalism led to ad hoc accretions in
the programs that made them unmaintainable. Many of Schank's
students found it easier to start from scratch and write a new parser
than to modify one that was written by an earlier generation of students.
Montague and Schank have complementary strengths: rigor
vs. flexibility; logical precision vs. open-ended access to background
knowledge; exhaustive analysis of a tiny fragment of English vs. a
broad-brush sketch of a wide range of language use.

Montague and Schank represent two extremes on the semantics-based
spectrum, which is broad enough to encompass most AI work on language.
Since the extremes are more complementary than conflicting, it is
possible to formulate approaches that combine the strengths of both:
a precise formalism, the expressive power of intensional logic, and
the ability to use background knowledge in language understanding.
To allow greater flexibility, some of Montague's rigid constraints
must be relaxed: his requirement of a strict one-to-one mapping between
syntactic rules and semantic rules; his use of lambda expressions as
the primary meaning representation; and his inability to handle ellipsis,
metaphor, metonymy, anaphora, and anything requiring background
knowledge. With a more appropriate formalism, such limitations
could be overcome within a rigorous theoretical framework.

1.4 Language Games

In the classical view of language, semantic theory requires an ontology
of all the concepts (or predicates) expressed by the words of a language.
Words have associated syntactic information about their parts
of speech and their obligatory and optional adjuncts.
Concepts are organized in structures that represent
knowledge about the world: an ontology of concept types;
Aristotelian definitions of each type by genus and differentiae;
selectional constraints on the permissible combinations of concepts;
and axioms or rules that express the implications of the concepts.
Then the lexicon maps words to concepts, listing multiple concept types
for words that have more than one meaning.
With many variations of notation and terminology, this view
has formed the basis for most systems in computational linguistics:

From the earliest days of machine translation, theorists have sought
a universal system of concepts for the elusive interlingua,
which would serve as an intermediate language for the translation
of any natural language into any other natural language.

Margaret Masterman's original semantic networks (1961) were
designed as an ontology for an interlingua. She constructed a lattice
of concept types defined in terms of 100 primitives, which she
intended as universal.

Terry Winograd's SHRDLU (1972) is a famous example of a fixed
mapping between word and concept types with a built-in mechanism
for defining new types.

Richard Montague (1974) formulated the purest expression of the
classical approach in his system of grammar and logic, which
deliberately set out to treat "English as a formal language."

Roger Schank and his students (1975) were strongly opposed to
logic-based approaches like Montague's, but their theory of conceptual
dependencies was just as classical. Their MARGIE system used only 11
primitive acts as a basis for defining all conceptual relationships.

Natural language query systems map a small vocabulary (usually
less than 5,000 words) to a fixed set of concept types that
represent the entities, attributes, and relationships in a database.

These systems have formed the basis for impressive prototypes.
Yet none of them have been general enough to be extended from small
prototypes to broad-coverage language processors:

Winograd's book on SHRDLU was entitled Understanding
Natural Language, but he has now repudiated that title
(Winograd and Flores 1986).
He denies that SHRDLU or any other system built along classical
lines could truly be said to understand natural language.

Schank now admits that language understanding is much harder than
he had thought. In his work on case-based reasoning,
he and his students have used a much larger range of concept types
without bothering to give explicit definitions in terms of primitives.

The most widely used machine translation systems are not based on
universal interlinguae. Instead, it has proved easier to implement
simpler, but often ad hoc transfer schemes between pairs
of languages. An example is the forty-year-old
SYSTRAN system,
which is still used by AltaVista to translate web sites.

Many computational linguists believe that unrestricted language
understanding is impossible or at least impractical with current means.
Instead, they have restricted themselves to designing processors for
limited domains (Kittredge & Lehrberger 1982).

Harris (1968, 1982) has long maintained that specialized grammars
must be written for the various "sublanguages" used in science.
He believed that recognition of distinct sublanguages of each natural
language is a theoretical necessity, not just a practical expedient.

The limitations of classical systems could be attributed either to
fundamental flaws in the approach or to temporary setbacks
that will eventually be overcome. Some computational linguists,
especially the logicians who follow Montague,
are still pursuing the classical ideal with newer
theories, faster computers, and larger dictionaries.
Others who once believed that language was more tractable
eventually lost faith and became some of the most vocal critics.
Bar-Hillel (1960) was one of the early apostates,
and Winograd is one of the more recent.

The most famous apostate who abandoned the classical approach
was Ludwig Wittgenstein. His early philosophy, as presented
in the Tractatus Logico-Philosophicus, was an extreme
statement of the classical view.
It started with the sentence "The world is everything
that is the case" -- a collection of atomic facts
about relationships between elementary objects.
Atomic facts could be combined to form a compound proposition,
which was "a function of the expressions contained in it."
Language for him was "the totality of all propositions."
He regarded any statement that could not be
built up in this way as meaningless, a view that culminated in
the final sentence of the Tractatus: "Whereof one cannot
speak, thereof one must be silent." Wittgenstein's early philosophy
was an inspiration for Tarski's model-theoretic semantics,
which Tarski's student Montague applied to natural language.

In his later philosophy, as presented in the Philosophical
Investigations, Wittgenstein repudiated the "grave mistakes
in what I wrote in that first book." He completely rejected the
notion that all of language could be built up in a systematic
way from elementary propositions. Instead, he presented the view
of language as a "game" where the meaning of a word
is determined by its use.
If there were only one set of rules for the game, a modified version
of the classical approach could still be adapted to it. But
Wittgenstein emphasized that language is not a single unified
game, but a collection of as many different games as one can
imagine possible uses. "There are countless
kinds: countless different kinds of use of what
we call 'symbols,' 'words,' 'sentences.' And this
multiplicity is not something fixed, given once and for all;
but new types of language, new language games, as we may say,
come into existence, and others become obsolete and get
forgotten." As examples of the multiplicity of language
games, he cited "Giving orders, and obeying them; describing the
appearance of an object, or giving its measurements; constructing
an object from a description (a drawing); reporting an event;
speculating about an event; forming and testing a hypothesis;
presenting the results of an experiment in tables and diagrams;
making up a story, and reading it; play acting; singing catches;
guessing riddles; making a joke, telling it; solving a problem
in practical arithmetic; translating from one language into another;
asking, thanking, cursing, greeting, praying." He regarded this view
as a complete rejection of "what logicians have said about the structure
of language," among whom he included Frege, Russell, and himself.

Wittgenstein's language games were the inspiration for speech act theory,
which has become one of the major topics in pragmatics.
Their implications for semantics, however, are just as important.
As an example, consider the verb support in the following
sentences:

Tom supported the tomato plant with a stick.
Tom supported his daughter with $10,000 per year.
Tom supported his father with a decisive argument.
Tom supported his partner with a bid of 3 spades.

These sentences all use the verb support in the same syntactic
pattern:

A person supported NP1 with NP2.

Yet each use of the verb can only be understood with respect to
a particular subject matter or domain of discourse: physical structures,
financial arrangements, intellectual debate, or the game of bridge.
Each domain has its own language game, but they all
share a common vocabulary and syntax.
The meanings of the words, however, change drastically from one domain
to the next. As a result, the mapping from language to reality is
indirect: instead of the fixed mappings of Montague grammar,
the mapping from words to reality may vary with every language game.

Both Wittgenstein's philosophical analyses and thirty years of
experience in computational linguistics suggest the same
conclusion: a closed semantic basis along classical lines
is not possible for any natural language.
Instead of assigning a single meaning or even a fixed set of meanings
to each word, a theory of semantics must permit an open-ended number
of meanings for each word. Following is a sketch of such a theory:

Words are like playing pieces that may be used and reused in
different language games.

Associated with each word is a limited number of lexical patterns
that determine the rules that are common to all the language games
that use the word.

Meanings are deeper conceptual patterns that change from one
language game to another.

Metaphor and conceptual refinement are techniques for transferring
the lexical patterns of a word to a new language game and thereby
creating new conceptual patterns for that game.

As an analogy, Wittgenstein compared the words of a language
to the pawns and pieces in a game of chess. An even better analogy
would be the Japanese games of go and go-moku.
Both games use the same board, the same pieces, and the same syntactic
rules for making legal moves: the board is lined with a 19 by 19 grid;
the pieces consist of black stones and white stones; and starting
with an empty board, two players take turns in placing stones
on the intersections of the grid. Figure 1.1 shows a position
from the game of go on the left and a position from go-moku on the right.

Figure 1.1: Positions from the games of go and go-moku

At a purely syntactic level, the two games appear to be the same.
At a semantic level, however, there are profound differences
in the meanings of the patterns of stones: in go, the goal is to
form "armies" of stones that surround territory; in go-moku, the
goal is to form lines with five consecutive stones of the same color.
As a result, a typical position in go tends to have stones scattered
around the edges of the board, where they can stake out territory.
A typical go-moku position, however, tends to have stones that are
tightly clustered in the center, where they can form connected lines
or block the opponent's lines. Although the same moves are syntactically
permissible in the two games, the semantic differences cause very
different patterns to emerge during play.

In the analogy with language,
the stones correspond to words, and the two games correspond to
different domains of discourse that happen to use the same words.
At a syntactic level, two different games may permit words or pieces
to be used in similar ways; but differences in the interpretation
lead to different meanings for the combinations.
To continue the analogy, new games may be invented that use the same
pieces and moves. In another game, the player with the black stones
might try to form a continuous path that connects the left and right
sides of the board, while the player with white would try to connect the
top and bottom. The syntax would be the same as in go and go-moku,
but the meanings of the patterns of stones would be different.
Just as old pieces and moves can be used in new games,
language allows old words and syntax to be adapted
to new subjects and ways of thinking.

Wittgenstein's theory of language games has major implications for
both computational linguistics and semantic theory. It suggests that
the ambiguities of natural language are not the result of careless speech
by uneducated people. Instead, they result from the fundamental nature
of language and the way it relates to the world: language consists of
a finite number of words that may be used and reused in an unlimited
number of language games. The same words may be used in different games
to express different kinds of things, events, and situations.
To accommodate Wittgenstein's games, this paper draws a distinction
between lexical structures and deeper conceptual structures. It suggests
that words are associated with a fixed set of lexical patterns that
remain the same in various language games. The meanings of those
words, however, are deeper conceptual patterns that may vary drastically
from one game to another. By means of metaphor and conceptual refinement,
the lexical patterns can be modified and adapted to different language
games in order to construct a potentially unlimited number
of conceptual patterns.

1.5 Interactions of the Lexical and Conceptual Systems

Every natural language has a well-organized lexical and syntactic system.
Every domain of knowledge has a well-organized conceptual system.
Complexities arise because each language tends to use and reuse the same
words and lexical patterns in many different conceptual domains.
In his discussion of sublanguages, Harris (1968) cited the following
two sentences from the domain of biochemistry:

The polypeptides were washed in hydrochloric acid.
* Hydrochloric acid was washed in polypeptides.

Harris observed that both of them could be considered
grammatical English sentences.
But he claimed that the grammar of the sublanguage of
biochemistry permitted the first one and excluded the second.
Harris's observations about permissible sentences in biochemistry
are correct, but he attributed too much to grammar.
What makes the second sentence unacceptable are facts about chemistry,
not about grammar. As in the games of go and go-moku, the syntax permits
either combination, but the semantics determines which patterns
are likely or unlikely.

In Harris's examples, the syntax clearly determines the subject
and object. Noun-noun modifiers, however, provide no syntactic clues,
and domain knowledge is essential for understanding them.
The following two noun phrases, for example, both use wash
as a noun that means a liquid used to wash something:

a hydrochloric acid wash
a polypeptide wash

The surface syntax of the noun phrases provides no clues to the underlying
conceptual relations or thematic roles.
Only knowledge of the domain
leads to the expectation that hydrochloric acid would be a component
of the liquid and polypeptides would be washed by the liquid.
A Russian or Chinese chemist with only a rudimentary knowledge of
English could interpret these phrases correctly, but an
English-speaking linguist with no knowledge of chemistry could not.
Although a chemist and a linguist may
share common lexical and syntactic habits, the
conceptual patterns for their specialties are unrelated.
An American, Russian, and Chinese chemist, however, would have no
shared lexical and syntactic patterns, but their conceptual patterns
in the field of chemistry would be similar.

Besides determining the correct syntactic patterns, a machine
translation system must also select the appropriate word senses.
For technical terms like hydrochloric acid or
polypeptides, which are used only in a narrow domain,
an MT system with a vocabulary tailored to the domain can usually
select the correct word sense.
More difficult problems occur with common words
that are used in many different domains in slightly different ways.
One Russian-to-English MT system, for example, produced the translation
nuclear waterfall for what English-speaking physicists
call a nuclear cascade. A technical word like nuclear has
a unique translation, but a more common word like waterfall has
more uses in more domains and consequently more possible translations.

The main reason why the word sense is hard to determine is that
different senses may occur in the same syntactic and lexical patterns.
The examples with the verb support all used exactly the same
pattern. Yet Tom performed totally different actions: using a stick
to prop up the tomato plant; giving money to his daughter; and
saying something that made his father's statements seem more convincing.
Physical support is the basic sense of the word,
and the other senses are derived by metaphorical extensions.
In other languages, the basic vocabulary may have been extended by
different metaphors.
Consequently, different senses that all use the same pattern in English
might be expressed with different patterns in another language.
Russian, for example, would use the following constructions:

Tom placed a stick in the ground in order to support [podd'erzhat']
the tomato plant.
Tom spent $10,000 per year on the support [sod'erzhanie] of his daughter.
Tom supported [podd'erzhal] his father with [instrumental case]
a decisive argument.

Russian uses the verb podd'erzhat' in different syntactic
constructions for the first and third sentences.
For the second, it uses a noun sod'erzhanie
derived from a related verb sod'erzhat'
(Nierenberg 1991).
As these sentences illustrate, different uses of a word
may be expressed with the same lexical and syntactic patterns
in one language, but the translations to another language may use
different words in different patterns.

The translation from English to Russian also illustrates another
point: human translators often add background knowledge that
is implicit in the domain, but not stated in the original words.
For this example, the Russian lexical patterns
required an extra verb in two of the sentences.
Therefore, the translator added the phrase placed
a stick in the ground in the first sentence
and the verb spent in the second.
The verbs place and spend and the noun
ground did not occur in the original, but the translator
(Sergei Nierenberg) felt
that they were needed to make natural-sounding Russian sentences.
A syntax-based MT system could not add such information,
which can only come from background knowledge about the domain.
(The term commonsense is often used for background knowledge,
but that term can be misleading for detailed knowledge in
technical domains -- most people do not have any commonsense
intuitions about polypeptides.)

As another example, Cruse (1986) cited the word topless, as
used in the phrases topless dress, topless dancer, and
topless bar. Literally, something is topless if it has
no top. That definition is sufficient for understanding the
phrase topless dress. For the other phrases,
a young child or a computer system without domain-dependent knowledge
might assume that a topless dancer and a topless bar are somehow missing
their own tops. An adult with knowledge of contemporary culture,
however, would know that the missing top is part of the clothing
of the dancer or of certain people in the bar.
Cruse gave further examples, such as topless by-laws
and topless watchdog committee, which require knowledge of
even more remote relationships, including public attitudes
towards topless behavior.
These examples show that domain-dependent knowledge is often essential
for determining the relationship between an adjective and the noun it
modifies. Computer systems and semantic theories that map adjectives
into simple predicates may represent the literal use in topless
dress, but they cannot interpret any of the other phrases.

For the different uses of support and topless,
the lexical and syntactic patterns are the same,
but the conceptual patterns are different.
These examples illustrates a fundamental principle: the same
lexical patterns are used across many different conceptual domains.
The lexical structures are

Relatively domain independent,

Dependent on syntax and word forms,

Highly language dependent.

And the conceptual structures are

Highly domain dependent,

Independent of syntax and word forms,

Language independent, but possibly culture dependent.

When there are cross-linguistic similarities in lexical patterns,
they usually result from underlying conceptual similarities. The
English verb give, for example, takes a subject, object, and
indirect object. Other languages may have different cases marked
by different prepositions, postpositions, inflections, and word order;
but the verbs that mean roughly the same as give
also have three participants -- a giver, a thing given,
and a recipient. In all languages, the three participants in the
conceptual pattern lead to three arguments in the lexical patterns.

The view that lexical patterns are reflections or projections of
underlying conceptual patterns is a widely held assumption in
cognitive science: the first lexical patterns a child learns are
derived from conceptual patterns for concrete things and events.
Actions with an active agent doing something to a passive entity
lead to the basic patterns for transitive verbs.
Concepts like Say or Know that take embedded propositions
lead to patterns for verbs with sentence complements.
Once a lexical pattern is established for a concrete domain,
it can be transferred by metaphor to create similar patterns in
more abstract domains.
By this process, an initial set of lexical patterns can be built up;
later, they can be generalized and extended to form
new conceptual patterns for more abstract subjects.
The possibility of transferring patterns from one domain to another
increases flexibility, but it leads to an inevitable increase in
ambiguity. If the world were simpler, less varied,
and less changeable, natural languages might be unambiguous.
But the complexity of the world causes the meanings of words to shift
subtly from one domain to the next. If a word is used in widely
different domains, its multiple meanings may have little or nothing
in common.

1.6 Information Extraction by Filling Templates

Syntactic theories relate sentence structure to the details of morphemes,
inflections, and word order. Semantic theories relate sentences to
the details of formal logic and model theory. But many of the most
successful programs for information extraction (IE) are based on
domain-dependent templates that ignore the details at the center of
attention of the major theories of syntax and semantics.
During the 1990s, the ARPA-sponsored Tipster project and a series
of message understanding conferences (MUC) stimulated the development
of those techniques. The results showed that the integrated systems
designed for detailed syntactic and semantic analysis are too slow
for information extraction. They cannot process the large volumes
of text on the Internet fast enough to find and extract the information
that is relevant to a particular topic. Instead, competing groups
with a wide range of theoretical orientations converged
on a common approach: domain-dependent templates for representing
the critical patterns of concepts and a limited amount of syntactic
processing to find appropriate phrases that fill slots in the templates
(Hirschman & Vilain 1995).

The group at SRI International (Appelt et al. 1993; Hobbs et al. 1997)
found that TACITUS, a logic-based text-understanding system was
far too slow. It spent most of its time on syntactic nuances that were
irrelevant to the ultimate goal. They replaced it with FASTUS,
a finite-state processor that is triggered by key words,
finds phrase patterns without attempting to link them
into a formal parse tree, and matches the phrases to the slots
in the templates. Cowrie and Lehnert (1996) observed that
the FASTUS templates, which are simplified versions of a logic-based
approach, are hardly distinguishable from the sketchy scripts
that DeJong (1979, 1982) developed as a simplified version
of a Schankian approach. The IPP example
discussed earlier is a typical example of the Schankian templates.

Many people have observed that the pressures of extracting information
at high speed from large volumes of text have led to a new paradigm
that is common to both the logic-based systems and the Schankian systems.
Appelt et al. (1993) summarized the IE paradigm in three bullet points:

"Only a fraction of the text is relevant; in the case of the
MUC-4 terrorist reports, probably only about 10% is relevant."

"Information is mapped into a predefined, relatively simple,
rigid target representation; this condition holds whenever entry
of information into a database is the task."

"The subtle nuances of meaning and the writer's goals
in writing the text are of no interest."

They contrast the IE paradigm with the more traditional task
of text understanding:

"The aim is to make sense of the entire text."

"The target representation must accommodate the full complexities
of language."

"One wants to recognize the nuances of meaning and the writer's
goals."

At a high level of abstraction, this characterization by the logicians
at SRI International would apply equally well to all the successful
competitors in the MUC and Tipster evaluations. Despite the differences
in their approaches to full-text understanding, they converged on a
common approach to the IE task. As a result, some observers have come
to the conclusion that IE is emerging as a new subfield in computational
linguistics.

The convergence of different approaches on a common paradigm
is not an accident. At both ends of the research spectrum,
the logicians and the Schankians believe
that the IE paradigm is a special case of their own approach,
despite their sharp disagreements about the best way to approach
the task of full-text understanding. To a certain extent, both sides
are right, because both the logic-based approach and the Schankian
approach are based on common underlying principles.
The logical operations of generalization, specialization,
and equivalence can be used to characterize all three approaches
to language processing that were discussed in Section 1.3:

Chomskyan. The starting symbol S for a context-free
grammar is a generalization: every sentence that is derivable
from S by a context-free grammar is a specialization of S,
and the parse tree for a sentence is a record of the sequence
of specialization rules used to derive it from S. Chomsky's original
goal for transformational grammar was to define the equivalence rules
that preserve meaning while changing the shape or appearance
of a sentence. The evolution of Chomsky's theories through the stages
of government and binding (GB) theory to his more recent minimalism has
been a search for the fundamental equivalence rules of Universal Grammar.

Montagovian. Instead of focusing on syntax, Montague
treated natural language as a disguised version of predicate calculus.
His categorial grammars for deriving a sentence are specialization rules
associated with lambda-expressions for deriving
natural language sentences. Hobbs et al. (1993)
explicitly characterized the semantic interpretation of a sentence
as abduction: the search for a specialized formula in logic that
implies the more generalized subformulas from which it was derived.

Schankian. Although Roger Schank has denounced logic
and logicians as irrelevant, every one of his knowledge representations
can be defined as a particular subset of logic with an idiosyncratic
notation. Most of them, in fact, represent the existential-conjunctive
(EC) subset of logic, whose only operators are the existential
quantifier and conjunction. Those two operators, which happen to be
the most frequently used operators in formulas derived from natural
language text, are also the two principal operators in discourse
representation theory, conceptual graphs, and Peirce's existential
graphs. The major difference is that Schank has either ignored
the other operators or treated them in an ad hoc way,
while the logicians have generalized their representations
to accommodate all the operators in a systematic framework.

In summary, the operations of logic reveal a level of processing
that underlies all these approaches. The IE templates represent the
special case of EC logic that is common to all of them. The detailed
parsing used in text understanding and the sketchy parsing used in IE
are both applications of specialization rules; the major difference is
that IE focuses only on that part of the available information that is
necessary to answer the immediate goal. The subset of information
represented in the IE templates can be derived by lambda abstractions
from the full information.
This view does not solve all the problems of the competing paradigms,
but it shows how they are related and how innovations in one approach
can be translated to equivalent techniques in the others.

Although IE systems have achieved acceptable levels of recall
and precision on their assigned tasks, there is more work to be done.
The templates are hand tailored for each domain, and their success rates
on homogeneous corpora evaporate when they are applied to a wide range
of documents. The high performance of template-based IE comes
at the expense of a laborious task of designing specialized templates.
Furthermore, that task can only be done by highly trained specialists,
usually the same researchers who implemented the system that uses
the templates.

Parts II and III of this article show how the IE templates fit into
a larger framework that links them to the more detailed issues
of parse trees, discourse structures, and formal semantics.
This framework is related to logic, but not in the same way
as the logic-based systems of the 1980s. Instead, it depends
on a small set of lower-level operations, called
the canonical formation rules, which were originally
developed in terms of conceptual graphs
(Sowa 1984). But those operations can be generalized to any knowledge
representation language, including predicate calculus, frames, and
IE templates. Part II presents the canonical formation rules,
and relates them to conceptual graphs (CGs), predicate calculus,
frames, and templates. The result is not a magic solution
to all the problems, but a framework in which they can be addressed.