Abstract

To date, most computational models of infant word segmentation
have
worked from phonemic or phonetic input, or have used toy datasets. In
this paper, we present an algorithm for word extraction that works
directly from naturalistic acoustic input: infant-directed speech from
the CHILDES corpus. The algorithm identifies recurring acoustic
patterns that are candidates for identification as words or phrases,
and then clusters together the most similar patterns. The recurring
patterns are found in a single pass through the corpus using an
incremental method, where only a small number of utterances are
considered at once. Despite this limitation, we show that the
algorithm is able to extract a number of recurring words, including
some that infants learn earliest, such as "Mommy" and the child's
name. We also introduce a novel information-theoretic evaluation
measure.