Computers with Common Sense

Summary

I am intrigued with the idea of giving computers a modicum of
common sense, or in other words a practical knowledge of everyday
things. This would have huge benefits, for instance, much smarter
ways of searching for information, and more flexible user interfaces
to applications. While it might sound easy, this is in fact very
difficult and has defeated traditional approaches based upon
mathematical logic and AI (artificial intelligence). More recently,
work on speech recognition and natural language processing using
statistical methods have shown great promise. Statistical approaches
offer a way out of the combinatorial explosion faced by AI, and I
am excited by Dan Sperber's
work on relevancy theory and the potential for applying statistical
learning techniques to semantics. Unfortunately, there is a lot to
do before it will be possible to realise this in practice.

My long term aim is to understand this better and to work with
others to put it into practice in the form of a multi-user
conversational agent accessible over the Web, so that we can harness
the power of the Web to allow volunteers to teach the system common
sense knowledge by conversing with it in written English (and
eventually other languages). This would be under an open source
license, and free for all to share. For some existing work on common
sense, see Henry Lieberman's MIT Media Lab course:
Common Sense Reasoning for Interactive Applications, with links
to the Open Mind Initiative,
and Doug Lenat's work on Cyc
amongst others. See also the Common
Sense Computing Initiative at the MIT Media Lab.

Update - March 2010

After a gap of several years, I have restarted work on this
project, beginning with extensive reading of research papers in
statistical natural language processing, machine learning and data
mining, and related work in cognitive science. I am particularly
interested in the potential for combining natural language,
cognitive science and the Semantic Web.

Cognitive
science focuses on the study of how information is represented
and transformed in the brain with a strong emphasis on experimental
results. Cognitive architectures such as ACT-R and CHREST provide
valuable insights into human cognition, and point the way to new
kinds of information systems. There are indications that these
architectures could be strengthened by incorporation of ideas
from work on machine learning and data mining.

Following the literature search, I have started coding as a way
to test my understanding of the various techniques. The British National Corpus
enabled me to explore statistical models for part of speech
tagging, and I am now working on a bottom-up chart parser for
broad coverage of written English. The parser won't attempt to
deal with all of the ambiguities, e.g. those caused by
prepositional attachments, and instead relies on a semantic
processor to find the most natural interpretations. I plan to
explore the use of WordNet for determining
conceptual matches, as well as models of human cognition from
cognitive science.

The natural language processor will translate English to a
semantic representation represented as labeled arcs with a subject,
predicate and object. Much of current work on the Semantic Web
is heavily influenced by formal logic. This is generally a poor
match for natural language semantics, but I am heartened by the
collection of papers in Natural
Language Processing and Knowledge Representation, published
in 2000 by the AAAI Press together with MIT Press. I plan to
explore techniques for inference and learning using triple-based
representations, together with models of human cognition as a
way to deal with scaling issues.

Architecture

The following is outdated, but still serves to give a
general feeling for where I am headed.

The system will be designed to support multiple simultaneous
conversations, either one on one, or as part of chat rooms where
the system is one participant amongst many people. The use of text
rather than speech avoids the costs and problems inherent in using
speech, although in principle, a speech interface could be added.

For initial work on proving the ideas, popular AI scripting
languages seem like a reasonable choice, e.g. python. scheme or
prolog. Later as experience is gained, it will become easier to
understand what architecture is going to be needed for a scalable
solution. One issue is the relationship between short and long
term memory, and the indexing mechanisms needed to support the
very large amount of information needed for an adequate treatment
of common sense reasoning. A further issue is safeguarding
information held about individuals contributing to the system.
Information learned from one person may need to be kept private
and not shared with other people.

Natural language input

The first step is morphological analysis and part of speech
identification. At its simplest, this is just a matter of looking
up words in a dictionary. In practice, the preceding words and the
conversational focus will be used to determine the most likely
interpretations of each word, based upon prior training against
a tagged corpus of written texts. This covers the recognition of
compound nouns and named entities. Further study is needed on
detailed requirements for representing word senses efficiently.

One idea is for there to be a unique name for each word sense.
Some grammars annotate lexical entries with attributes that indicate
gender, cardinality and many other properties. Words are often
used in ways that are highly dependent on the context. This
suggests that a collection of concepts may be needed to capture
the fluid nature of word meanings, and that relationships between
words should be expressed as relationships between such collections.
Further work is need to understand this better.

The next step is parsing. Natural language is typically highly
ambigous and a long sentence may have many hundreds of alternative
parses. To avoid this, the system will be trained to rank grammar
rules according to the context. This enables the use of the "A*"
algorithm to find the most likely parses. The initial training will
be done against a tagged corpus (generally known as a "tree-bank").
There are a wide range of grammatical formalisms and the choice
will be influenced by the design of the lexicon as well as the
availability of training materials.

Resolution of deixis, anaphora and prepositional attachment will
be addressed at a higher level, using the semantic context provided
by the current conversation. This is a departure from traditional
statistical natural language parsing techniques, but is essential
for an adequate treatment of common sense and relevancy theory.
The most likely meaning of a word depends on the semantic context,
and not just the preceding few words. Parsing and semantic
processing are thus intertwined, each one feeding off the other.

The purpose of parsing is to enable the system to draw appropriate
inferences. This involves the construction of statements that
represent meaning within the current context. In the simplest case,
the meaning of the utterance follows directly from the composition
of the literal meanings of the words. In other cases, the meaning
is highly dependent on understanding the context and potential
goals of the person making the utterance. Idiomatic phrases should
be recognized as such, bypassing the normal parsing and semantic
interpretation mechanisms.

Semantics will be represented in terms of a labelled graph or
semantic network. The nodes in a semantic network could be explicit
concepts or compound concepts that are themselves queries against
the system's short and long term knowledgebase. This virtualizes
concepts. In general the concepts used in assertions or rules are
virtual, so that knowledge retrieval is founded on a semantic
indexing mechanism.

A major issue is how the system learns semantics. There could
be a core based upon training against a tagged corpus, but I
believe that some kind of bootstrapping process will be needed.
This could start with very simple concepts, and is likely to need
manual work at a level below the conversational interface.

Major issues for further study

ideas for gathering data for training corpora

design and representation of training corpora

use cases and requirements for the lexicon

choice of grammar formalism

use cases and requirements for semantic representation

design of indexing mechanisms

Reasoning

How does the system react to the semantics of natural language
input and carry out the reasoning needed to generate an appropriate
response? What kinds of mental states are involved? Relevancy
theory describes notions of how inference can be minimized to
match the current situation. With the huge amount of information
available, inference is likely to drown in a combinatorial
explosion unless some means is found to contain it and channel
it in a useful direction. The solution seems to involve the use
of relevant contexts. I believe that much thinking is in terms of
constructing and acting out stories, but how to expand on this
idea?

Relevancy Theory

Deirdre Wilson and Dan Sperber's relevancy
theory introduces relevance as key to minimizing the
effort needed to understanding an utterance. The greater the
relevance, the less the effort that is needed. In a cooperative
dialog, the speaker will be expected to make his or her utterances
as easy as possible for the listener to understand. That means
that the intended meaning of each utterance should be maximally
relevant to the listener.

These are developed in parallel against a background of
expectations which may be revised or elaborated as the utterance
unfolds. In particular, the hearer may bring to the comprehension
process not only a general presumption of relevance, but more
specific expectations about how the utterance will be relevant to
him (what cognitive effects it is likely to achieve), and these
may contribute, via backwards inference, to the identification of
explicatures and implicated premises.

The paper includes worked examples that illustrate the kinds
of inference involved, e.g.

Peter: Did John pay back the money he owed you?

Mary: No. He forgot to go to the bank.

This involves the realization that Mary is probably a friend
of John. Mary prefers to be repaid in cash and not with a personal
cheque. John didn't have enough cash on him to pay it back to Mary.
John intended to get some more cash from the bank, but forgot to
visit the bank. As a result he wasn't able to pay Mary. He is
likely to pay her soon after a trip to the bank. If he forgets
again, Mary will be upset and he will be embarrassed, neither
of which he wants to happen.

Understanding through telling stories

Borrowing money from friends is a common occurrence and this
makes the above example easy to understand. The process of
understanding can be considered as constructing a story that
explains the utterances. This new story can be based upon
remembered stories involving yourself or others. It should be
as simple as possible to explain the utterances.

Understanding in a complex world

From relevance theory we have the idea that a speaker will
be expected to make his or her utterances as easy as possible
for the listener to understand. Sometimes the speaker will
fail to do so, and a competent understander will usually be
able to cope with such failings. Social contexts often
influence how things are put. Speakers may use indirect
statements when a direct statement might be seen as impolite
(either by the immediate listener or someone else nearby).

Sometimes people have ulterior motives and may seek to
deceive or to plant an idea that benefits themselves but not
the recipient. One example of this is advertising messages
that aim to make you think that smoking will make you appear
sexy. A sophistocated listener will seek to understand the
what benefits the speaker would gain if you were to accept
the idea implied by their statement. Is the speaker sincere,
or deceitful? If sincere, is he to be trusted in this matter?

Full-fledged communicative competence involves, for the
speaker, being capable of having at least third-order
meta-representational communicative intentions, and, for
the hearer, being capable of making at least fourth-order
meta-representational attributions of such communicative
intentions. In fact, when irony, reported speech, and other
meta-representational contents are taken into consideration,
it becomes apparent that communicators juggle quite easily
with still more complex meta-representations.

An adequate treatment of common sense needs to cover such
sophistocated use of language, and this has obvious implications
for the way meta-representational intents are expressed as
semantic networks. First order representations are clearly
inadequate!

Episodic reasoning

According to Merriam Webster, an episode is a usually brief
unit of action in a dramatic or literary work. It is a developed
situation that is integral to but separable from a continuous
narrative. It is an event that is distinctive and separate
although part of a larger series. Episodic reasoning is thus
about how things change in time. We can reason about the causes
of changes associated with events and about the properties that
hold throughout an interval. The frame problem of AI
relates to the challenges in keeping track of what changes and
what doesn't in any given situation when trying to reason about
plans of actions. Humans appear to address this problem by
exploiting learned regularities that influence what is most
relevant. We very rarely work things out from first principles.

Non-verbal reasoning

When you look across a room and decide to take a closer look
out of the window, you use non-verbal reasoning to plan a route
across the room, for instance walking around a dining table in
the center of the room. Common sense reasoning isn't restricted
to verbal reasoning and plays an important role in how you make
sense of what you are looking at in everyday scenes. Understanding
images and movement is made easier because of the statistical
and causal regularities in how things fit together.

Statistical models of syntax and semantics can be applied to
vision in an analogous way to how they apply to spoken and
written language. The syntax describes the visual textures and
optical flow that delineate the image. The semantics describe
the objects making up the scene and how they fit together. Some
examples include, the patterns of movement of someone walking,
the effects of perspective on the changing appearence of buildings
as we walk past them, and possible forms a pair of spectacles can
take when resting on a table. Our effortless understanding of
images relies on access to a vast amount of everyday knowledge
structured at many different levels. Once again relevancy theory
is needed to limit processing to what is most relevant in any
situation.

A text-based conversational agent will need some awareness
of non-verbal reasoning, but my hope is that others will be
inspired to work on this once the basic idea of statistical
treatment of semantics has been demonstrated for verbal
reasoning.

Perceptual illusions

Optical illusions reveal the existence of the shortcuts we
use to efficiently process images. The shortcuts work well under
normal circumstances, but break down when the assumptions are
changed. Analogous illusions can be devised for verbal
reasoning. I would like to collect examples of these.

Suggestions for further study

Studying a small number of examples in greater detail:

Recasting inferences from English into semantic networks

Sorting inferences by their relevance and inter-dependencies

Representing the episodic nature of example stories

Identifying common sense knowledge needed for such inferences

Knowledge about stories versus general background knowledge

Hints for how to find relevant stories for a given situation

How Stories can be used to add to general background knowledge

Natural language output

The system first needs to construct a semantic representation
of the utterance. This involves modelling the mental states of
the listener and the current state of the conversation. The
utterance has to be broken down into a sequence of pieces that
can then be mapped into natural language.

The next step is to identify natural language templates that
match the utterance and to instantiate them. At this point a
statistical language model can be used to propose words.