Reading Development

A recent paper by Seidenberg and McClelland describes a
computational model of word recognition and naming. The authors
claim that it is a developmental model; that is, it explains how
word recognition skills are acquired by children. The purpose of
this paper is to challenge that claim and, in doing so, to set
out a number of criteria against which a model of reading
development can be assessed.

To summarize the criteria:

The environment of learning should reflect that of the
child.

The representations used should be accurate and adequate for
learning.

The model's performance should reproduce observations of
children's performance.

The model should be consistent with wider theories of
cognition.

Some criticisms of the Seidenberg and McClelland model arise
directly from their choice of a connectionist architecture for
implementing the model. We try to identify those parts of the
model that result from this choice and argue that this type of
connectionist model is an inappropriate way of describing
learning.

We claim that this discussion of models of word recognition is
also relevant to other areas of cognitive development.

A computer model of early word recognition has been built that
uses only visual cues to recognise words, broadly following
Seymour's view of reading development. Written words are
represented as a partial ordering on a set of letter instances.
This allows various degrees of positional information to be
recorded, from none at all to a complete ordering of all the
letters in the word. In addition, some letters may not be
identified, their place being marked by a symbol representing,
for instance, any ascender, any letter or any group of letters.

The model is exposed to examples of first year reading
material collected from local schools. As it `reads' it adds new
words to its lexicon. The performance of the model can be
monitored as the lexicon grows and a corpus of substitution
errors (where an incorrect word is substituted for a target) can
be collected.

This paper describes a number of experiments performed using
the model to explore the development of visual word recognition.
Parts of the recognition procedure are varied and the results are
compared to those observed in young children at various stages of
development.

S. Cassidy, P. Andreae and G. B. Thompson. A
Computational Model of the First Stage of Learning to Read.
Unpublished manuscript. ( PDF)

Current theories of reading development leave unspecified many
of the details of the procedures used by children as they learn
to read. This lack of detail prevents many questions being
answered about the relationship between various sources of
knowledge and the child's new reading skills. One way of forcing
details to be considered is to build a computational model; this
paper describes such a model of the very first part of a child's
reading development, corresponding perhaps only to a few weeks of
developmental time. One aim of this work is to show how children
apply the skills they have before they start to read to their
very first experiences with print.

The model implements a visual word recognition procedure based
on a lexicon of stored representations accessed via visual cues.
Words are stored initially as an unordered set of letter tokens.
This representation is incomplete in that some letters may be
replaced by markers and others may be omitted altogether. As the
reading procedure develops, the representation becomes more
accurate and the order of letter tokens is also stored. The way
that words are selected from the lexicon changes so that initial
and final letters are used as cues. This developmental pattern
is explored in a series of `snapshot' simulations which model the
procedure under a given set of parametric assumptions. The
simulations are used to predict the characteristics of the
reading procedure, including the types of errors made at each
stage. The error profile of the hypothesised developmental
sequence for visual reading is shown to correspond to published
data from longitudinal studies of children's reading. Finally,
an account of development during this first stage of reading is
presented in terms of Karmiloff-Smith's representational
redescription framework.

Speech

Recent studies in the perception of speech have suggested that
vowel identification depends on dynamic cues, rather than a
single `static' spectral slice at the vowel target. The
experiments reported in this paper seek both to test the extent
to which vowel recognition depends on dynamic information, and to
identify the nature of the dynamic cues on which such recognition
might depend. Both Gaussian classification techniques, as well as
different kinds of neural network architectures, were used to
classify around 2000 vowels in /CVd/ citation-form Australian
English words, following training on roughly the same number of
vowel tokens. The first set of experiments shows that when
vowels are classified from three spectral slices taken at the
vowel margins and midpoint, only diphthongs, but not
monophthongs, benefit from the additional spectral information at
the vowel margins. The second set of experiments, in which a
time-delay neural network is used, suggests that dynamically
changing acoustic information is beneficial to only a small
number of monophthongs: However, diphthongs are no better
classified from this network than one in which time is not
explicitly represented, and many monophthongs perform just as
well when they are classified from a single `static' spectral
slice at the midpoint. The implications of this study are that
not all vowels are dynamic, while those vowels which can be
labelled dynamic are dynamic in different ways.

This study concerns the extent to which place of articulation
in the voiced obstruents /b d dz g/ can be separated from
spectral parameters taken in the burst, formant transitions, and
a combination of the two. Classifications were obtained by
training on citation-form data produced by male speakers and
testing on (i) citation-form data produced by female speakers and
(ii) continuous speech data produced by the same male
speakers. The results show that there is more information for the
place distinction in the burst than in formant transitions; when
the parameters are combined into a single model, classification
scores are improved for the citation-form data, but not for the
continuous speech data. The highest classification scores were in
the vicinity of 90% correct for both types of data on the
combined parameters. The results are seen as supporting a model
of sufficient discriminability rather than one in which phonetic
categories are characterised by invariant acoustic cues.

The Emu speech database system enables the annotation of
speech signals at many levels of detail and provides a
mechanism for making links between these levels to produce a
hierarchical annotation. Emu provides facilities for searching
collections of these annotations according to both sequential
and hierarchical criteria. The results of a search can be used
to retrieve acoustic and other data stored along with the
annotations. One perceived problem with the Emu system is its
ability to scale to large databases containing many thousands
of utterances. To address this problem we propose a method of
translating an Emu database into the relational model, as used
by most commercial database systems. Using a Tcl script, the
Emu database is converted into a set of tables for the
relational database. Queries in the Emu query syntax are
translated into SQL and comparisons are made between the query
processing time for Emu and the relational database. The
results show a marked increase in speed for the relational
system on most queries.

Researchers in various fields, from acoustic phonetics to
child language development, rely on digitised collections of
spoken language data as raw material for research. Access to
this data has, in the past, been provided in an ad-hoc manner
with labelling standards and software tools developed to serve
only one or two projects. A few attempts have been made at
providing generalised access to speech corpora but none of
these has gained widespread popularity. The Emu system,
described here, is a general purpose speech database management
system which supports complex multi-level annotations. Emu can
read a number of popular label and data file formats and
supports overlaying additional annotation with inter-token
relations on existing time-aligned label files. Emu provides a
graphical labelling tool which can be extended to provide
special purpose displays. The software is easily extended via
the Tcl/Tk scripting language which can be used, for example,
to manipulate annotations and build graphical tools for
database creation. This paper discusses the design of the Emu
system, giving a detailed description of the annotation
structures that it supports. It is argued that these
structures are sufficiently general to potentially allow Emu to
read any time-aligned linguistic annotation.

Annotated speech corpora are databases consisting of signal
data along with time-aligned symbolic `transcriptions'. Such
databases are typically multidimensional, heterogeneous and
dynamic. These properties present a number of tough challenges
for representation and query. The temporal nature of the data
adds an additional layer of complexity. This paper presents
and harmonises two independent efforts to model annotated
speech databases, one at the University of Pennsylvania, and
one at Macquarie University. This paper introduces annotated
speech databases and presents two computational models. A
range of actual and possible query languages are described,
along with illustrative applications to a variety of analytical
problems. The research reported here forms a part of various
ongoing projects to develop platform-independent open-source
tools for creating, browsing, searching, querying and
transforming linguistic databases, and to disseminate large
linguistic databases over the internet.

Large annotated collections of speech data are now common in
spoken language research and a recent focus has been on the
development of annotation standards and query languages for
these annotations As part of this process it is important to
evaluate the emerging proposals against a range of Linguistic
annotation practices and in many different domains.

This paper presents an example of a richly annotated
discourse segment which includes both DAMSL style discourse
level annotation and ToBI intonational analysis. We describe
how this annotation could be realised in either the Emu, MATE
or Annotation Graph formalisms.

In order to evaluate the different query languages we take a
small number of queries and attempt to express them in each
query language. We are particularly interested in the
naturalness of the query expression in each case. In some
cases we find that queries cannot be expressed in the current
language. We make a number of suggestions to guide the
development of these query languages.

Recent work has shown that single data model can represent
many different kinds of Linguistic annotation. This data model
can be expressed equivalently as a directed graph of temporal
nodes (Bird and Liberman, Speech Communication, 2000) as a set
of intersecting hierarchies (Cassidy and Harrington, Speech
Communication, 2000). While some tools are being built to
support this data model, there is as yet no query language that
can be used to search annotations stored in this way. Since
the hierarchical view of annotations has much in common with
the XML data model, this paper examines a recent proposal for
an XML query language as a candidate annotation query
language. The methodology used is a use case analysis. The
result of the analysis shows that XQuery provides many useful
features particularly when queries include hierarchical
constraints but that it is weak in expressing sequential
constraints.

Backchannels are short interuptions by a
second talker within a dialogue which typically signify agreement
by a second party with the main talker or the desire to
interrupt. While these short utterances don't contain much
useful semantic content in themselves, they do provide a key to
understanding the structure of a dialogue. We describe
experiments aimed at automatically segmenting a multi-party
teleconference dialogue into speaker turns with particular
emphasis on the detection of short (less than 500ms)
utterances. Using a Bayesian Information Criterion (BIC) to
detect acoustic changes we are able to segment an input signal
and find around 80% of all turn boundaries and around 40% of all
segments with a duration of less than 500ms.

The longer term goal of this project is to
be able to track speakers throughout a multi -party meeting in a
normal meeting room environment. This study reports on initial
results in detecting acoustic change in this environment using an
array of four microphones. We contrast these results with a
single microphone condition. The use of multiple microphones is
expected to aid acoustic change detection because of the use of
spatial information and improved signal to noise ratio. We
recorded six people in a 30 minute meeting with four cardioid
microphones arranged at the corners of a square. Acoustic change
hypotheses were generated separately for each channel using the
Bayesian Information Criterion [1]. We found that combining the
acoustic change hypotheses from the different channels resulted
in a superior overall segmentation of the signal compared with a
single microphone condition.

In this paper, we demonstrate the
advantages of combining the largely complementary systems of
Praat, a computational system for doing phonetics, with the EMU
system for speech database analysis. The interface applies to the
annotations in which a Praat TextGrid is converted into an EMU
hierarchical annotation structure and vice-versa. With the
exception of annotations in EMU that are not explicitly linked to
times, we show that there is no loss of information in this
conversion. The interface between the Praat and EMU systems
provides a flexible labelling system: the data can be labelled as
segments or events in Praat and various kinds of structures
between annotation tiers can be defined and then queried within
EMU. We argue that both the variety of existing speech databases
as well as the multitude of different possible types of speech
analysis require a modular approach allowing the integration of a
number of different stand-alone components that are adapted to
different aspects of creating, annotation, querying and analysing
speech data.