The Modulation Theory of Speech

As we have seen, [background (in Swedish)],
speech signals contain linguistic, expressive, organic and perspectival
information. Listeners are capable of distinguishing these different
types of information from each other, but the acoustic properties
usually measured by phoneticians are affected by several of these
factors. This has not been treated adequately in previous theoretical
reasoning and in theories of speech perception. The Modulation Theory of
Speech is intended to let us see how these different kinds of
information can be separated again, based on an analysis of how they are
fused when speech is produced.

Within the frame of the Modulation Theory, man's facility of
communicating by means of speech is seen as a biological innovation that
is founded on a facility of expressive communication that has been
around for a long time before and which still plays an important part in
human communication.

In accordance with this, speech signals
are regarded as the result of a process in which a carrier signal, whose
properties are given by organic and expressive factors, has been modulated with conventional linguistic speech gestures.

A linguistically neutral carrier signal can be thought of as a
'colorless' vowel, a primitive human vocalization that occurs, e.g., as a
hesitation sound. Its properties are given by the size of the speaker's
organ of speech (vocal fold mass and length, vocal tract length, etc.)
and by its paralinguistic "settings".

The acoustic properties of speech
signals deviate from those of a neutral carrier signal in a way that is
specific to each speech sound.

Thus, the linguistic phonetic quality is associated with these
deviations and not immediately with the absolute properties of the
speech signal.

For the perception of the different types of information in speech, this implies that a demodulation is necessary in order to be able to separate them.

The listener has to discover how the carrier signal has been modulated
in order to be able to recognize the conventional linguistic
information. On the other hand, the modulation must not affect his
judgment of the organic and expressive qualities, which are reflected in
the carrier signal. Thus, the listener has to separate the modulation
from the carrier signal and to judge each by its own.

When an infant says its first word, it demonstrates that it has
acquired at least a rudimentary control over the processes which are
described by the Modulation Theory of Speech. When a child imitates
something an older person has said, which is what happens here, it must
first have recognized how the older person has modulated his carrier,
and thereafter it must have modulated its own carrier in the same way.
The imitation of any bodily posture or gesture follows an analogous
procedure. There is a carrier (body, hand, face, vocal tract) that
provides a system of reference and standards of comparison used in
transposing the posture or gesture into a different system of reference
with different standards of comparison.

In describing speech perception, it is important to measure each
type of deviation with the right kind of ruler. On these rulers, equal
intervals have to be equivalent for the listener. Thus, it would be
wrong to measure pitch and its deviations from its base value in Hz,
which is the physical unit of frequency. Pitch is more correctly
represented in semitones or some other measure that is proportional to
the logarithm of frequency. For formant frequencies, a tonotopic (bark)
scale appears to be the correct choice, but certain power functions of
frequency can also be used. For intensity differences, a dB-scale
appears to be close to ideal.

In order to recognize the linguistic quality of speech sounds,
listeners can be said to evaluate the deviations of the instantaneous
properties of the speech signal (F0, formant frequencies, etc.) from those they expect
of a linguistically neutral sound with the same organic and expressive
quality. In this process, the expectations of listeners are governed by
extrinsic properties known from previous experience, e.g., when they
know the speaker, or when they have heard him speak for a while, and by
such intrinsic properties as the frequency positions of the higher
formants (F3 and above), which are not affected as much as F1 and F2 by a variation in linguistic quality. As we have seen on the previous page, F0
plays an important part in this connection. Listeners appear to obtain
(unconsciously) an estimate of its base value by analyzing how the F0-curve did look during the recent past.

Listeners evaluate the instantaneous positions of the spectral peaks shaped by the formants in relation to each other and to the base value of F0.
Experiments have shown that listeners do this above all with spectral
peaks that are fairly close to each other. In this way, it is often
possible to discover the linguistic information encoded in the formant
frequencies without depending on prior recognition of the organic and
expressive quality. When the acoustic signal is deficient in
information, e.g., in whispering, when F0 is missing, or in the presence of any disturbing noise, listeners have to rely more on their expectations.

In the presence of disturbing noise, it becomes very clear that
the recognition of the linguistic quality of speech signals also in
other ways is driven by expectations, and not only by the speech signal.
Listeners have a capacity of hearing also that which can not be heard
objectively. This phenomenon is known as "perceptual restoration". In
the process of listening, listeners continuously test how compatible the
properties of the signal are with different alternative
interpretations, and a speech signal remains compatible with an
interpretation even if it is partially masked by a disturbing noise.
Phenomena of this kind are illustrated on the next page (in Swedish).

Hartmut Traunmüller (1994) "Conventional, biological, and environmental factors in speech communication: A modulation theory" Phonetica 51: 170-183. doi (Also in PERILUS XVIII: 92-102.)
Note: The terms "expressive" and "organic" (quality,
information, properties), are much more adequate and should be
substituted for "affective" and "personal" used in that paper.