Pitch and timbre are terms frequently used in studies
on sound perception. Despite the existence of formal definitions, these terms
are often used ambiguously in the literature. This paper is intended as a
review of the ANSI definitions and their shortcomings, of modern ways to define
the concepts operationally, and of the various dependencies of pitch and timbre
on physical attributes of sound. Finally, their independent functioning in
speech, their mutually dependent functioning in music,
and their mediating role in object recognition will be discussed.

INTRODUCTION

The terms pitch and
timbre refer to subjective,
perceptual attributes of sound that play an important
function in the perception of speech and music. Because the attributes referred
to are subjective, they can be examined only by psychophysical
methods and cannot be measured by direct physical means.

A first requirement for meaningful use
of terms like pitch and timbre is a definition. Formal definitions do exist
(American National Standards Institute 1973), but appear not to be very useful.
In contemporary literature they are being replaced by definitions that are less
formal and more operational. The shortcomings of the ANSI definitions and
current ideas of what the terms should mean are discussed in the first section
of this paper.

A second question is the dependence of
each of these subjective attributes on physical attributes of sound such as
frequency, intensity, spectral shape or temporal envelope. This will be
reviewed in the second section of the paper.

A third issue is the interrelation
between the two subjective attributes in practice. In speech, pitch contours of
spoken sentences are the principal carriers of prosodic information (if we
disregard the so-called tone languages where pitch contours also convey
semantic information). Timbres or timbre transitions, on the other hand, enable
us to identify phonemes or phoneme clusters, resulting in understanding of what
is being said. Pitch and timbre patterns each have a specific function and
appear to act quite independently of one another. In music, however, pitch and
timbre have a mutual relationship and dependence imposed by requirements of
consonance when sounds occur together. This is discussed in the third section
of the paper. The last section deals
with the cognitive issue of object identificationspeech, vowels and consonants are recognized on the basis of their
formant structure, Le., their spectral shape. In
subjective terms this implies the use of timbre-like cues. In music, the sound
of particular musical instruments or instrument combinations is recognized on
the basis of perceived timbre. A question is whether timbre recognition is
synonymous with the recognition of a sound source, Le.,
a particular musical instrument, or whether timbre represents a separate perceptual
space which mediates in the recognition of musical objects.

DEFINITIONS

The American National Standards Institute (1973)
defines loudness as "..that intensive attribute of auditory sensation in
terms of which sounds may be ordered on a
scale extending from soft to loud", pitch as "..that attribute of
auditory sensation in terms of which sounds may be ordered on a scale extending
from high to low", and timbre as
"..that attribute of auditory sensation in terms
of which a listener can judge that two sounds, similarly presented and having
the same loudness and pitch, are different". Timbre therefore is defined
in a purely negative manner as
"everything that is not loudness,
pitch, or spatial perception".

A review of modern psychoacoustical
literature reveals that the ANSI definitions
have not been particularly useful. The main reason is that they do not really provide a clear distinction between the
three subjective attributes of sound. For instance, the difference between the
definitions of loudness and pitch rests almost
entirely on the semantic difference between the endpoints of the scales soft-loud
vs. low-high. This has led to some interesting confusions in the literature.Tanner and Rivette
(1964) reported that speakers of Punjabi, one of the manyIndian languages, had unusually large
differences limens for frequency. They had asked their subjects, following
common practice, to discriminate between two tonesby telling which of the two was the higher one.

Burns and Sampat (1980), who repeated the experiment,
found that frequency difference limens became perfectly normal if the
instructions to the subjects accounted for the fact
that in the Punjabi language the same
word is used to indicate that a sound is "high in pitch" or "loud".
It has also been established that ratings along a scale that ranges from dull
to sharp account for most of the variance of timbre ratings of complex
sounds,(von Bismarck 1974). The
difference between a low-high and a dull-sharp scale appears to be semantically
quite small and easy to confuse. Therefore, many timbre effects and phenomena may have been
wrongly identified as pitch effects in the literature.

The pitch of periodically interrupted noise (Miller
& Taylor 1948), the (missing fundamental) pitch of temporally sequential
successive harmonics (Hall & Peters 1981), and binaural edge pitch (Klein
& Hartmann 1981) were all identified and measured by means of discrimination
or matching experiments, where only ordinal properties of the sensation play a
role. The reported pitch effects could therefore,
entirely consistent with the definitions and the experimental evidence, very
well have been timbre effects.

Looking at the everyday use of the word
pitch from a musical viewpoint, one observes it actually entails much more than
the ANSI definition gives credit for. Although pitch space is a continuum, it
is, at least in Western music, treated as a collection of steps, where pitch
intervals correspond to certain well-defined frequency ratios. The fact that we
not only can tell that one sound is higher than another, but also can compare
and identify the magnitude of pitch steps, is hardly accounted for in the ANSI
definition. Modern pitch studies therefore often use experimental paradigms
that involve musical interval or melody identification. Their explicit or
underlying operational definition is that pitch is the subjective correlate of
each one of the acoustical events in a musically meaningful sequence of tones (Houtsma & Goldstein 1972; Roederer
1979).

Although pitch is a single perceptual
attribute, it is as such not necessarily one­dimensional.
Music psychologists in the past, for instance, have pointed out that in one sense
two notes that are a semitone apart are closer than two notes that are
separated by an octave, but that in another sense the octave notes are much
closer related and confusable than the semitone. Think of the typical octave
errors often made in absolute pitch judgements. This has led to rather complex
pitch-space representations in which the chromatic tone scale, the circle of
fifths, octave circularity and other properties are all accounted for (Shepard
1982). Three of such representations have been shown in Fig. 1, in increasing
order of complexity.

The word timbre is used almost
exclusively in the psychoacoustic literature with respect to music, and is
hardly found in the speech literature. In speech the perceptual entity is a
phoneme (vowel of consonant), and not some arbitrary point in timbre space. In
music-related studies timbre has always been treated as a multidimensional
continuum in which any point is potentially meaningful. It has been established
by rating and multidimensional scaling techniques that the space can be
adequately described in four subjective dimensions (dull-sharp, compact­scattered, colorful-colorless
and full-empty) which are linked to physical dimensions such as spectral energy
distribution, amount of high-frequency energy in the attack, and amount of synchronycity high-harmonic transients (von Bismarck 1974;
Grey 1977). A modern development, somewhat analogous to what is observed in
speech research, is to consider timbre in close connection with object
identification, both for musical and natural, environmental sounds (Handel
1995).

Fig.1

PHYSICAL DEPENDENCIES

Pitch

The perceived pitch of a sound depends most of all on
its frequency. For a pure tone this is rather unambiguous since there is only
one frequency. For a complex sound, where several frequencies are involved,
pitch salience depends on the degree to which partials are harmonic. For a
harmonic sound the pitch depends on the frequency of the fundamental, no matter
whether it is physically present or not (for a review, see Houtsma
1995). For inharmonic complex tones such as church or
carillon bells, the pitch depends on the frequencies of certain partials and is
usually less salient than for harmonic tones.

The pitch of a pure tone can also be influenced by
changing its intensity (Terhardt 1974), duration
(Doughty & Garner 1948), attack/decay envelope (Hartmann 1978), the amount
of (partial) masking noise (Terhardt & Fastl 1971), and the ear to which the tone is presented
(van tier Brink 1970). From a viewpoint of music practice this would seem to
raise havoc in a music performance because tones would have to be constantly
compensated for these effects. Fortunately, the mentioned pitch dependency
effects are mostly absent when complex tones are used, as is the case in most
music.

Timbre

The timbre of a sound, being itself a multidimensional
attribute, depends on several physical variables. There is, in the first place,
the frequency content and the spectral profile of the sound. Because the human
ear has limited frequency resolution power, the spectral composition vector can
often be reduced to a vector representing the amount of instantaneous acoustic
power in each critical band (Plomp 1970) without much
perceptual loss of information. This is limited by the fact that phase
relations between spectrally unresolved harmonics have an audible influence on
timbre (Goldstein 1967). Finally, the temporal envelope of an instrumental
sound, including attack, decay and modulation of the steady-state portion,
influences the perceived timbre to such an extent that changes on any of them
can make the sound of an instrument unrecognizable (Berger 1964).

Empirical distinction

Looking at all dependencies listed above, one
concludes that the sound attributes pitch and timbre have not only been defined
in a rather ambiguous manner, but also depend to some extent on the same
physical variables. It is therefore not surprising that there is confusion in
the literature about pitch and timbre, and that effects sometimes have been
misnamed.

As an illustration
of how one could possibly distinguish the two attributes empirically, an
experiment will be reviewed which 1 reported more than a decade ago (Houtsma 1984). Seven different types of sound were selected, all having been reported
in the literature to evoke some kind of pitch sensation. With each of the sound
types, random 4-note sequences were played by sampling notes of a diatonic
scale. Subjects were asked to play the perceived sequence (`melody') back on an
8-note keyboard. All, of course, were familiar with keyboard playing and were
able to play tunes `by ear' without sheet music. The data were analyzed by
computing a correlation coefficient between presented and played-back
sequences. Such a coefficient is unity if all sequences are played back correctly.
The coefficient remains high as long as the order (up/down movement) between
presented and perceived sequences is preserved. This can be seen in Fig. 2a,
where correladons scored are shown for each subject
and for each of the seven different types of sound.

Fig. 2.

The second type of analysis, shown in
Fig. 2b, was an actual count of the notes that were correctly played back and
therefore must have been perceived correctly. A high score in this manner of
counting requires not only ordinal (low vs. high), but also interval (step
size) perception. By comparing both types of analysis one can distinguish those
sensations that have only ordinal properties from those that have interval or
ratio properties. It was found that some sounds yielded high scores with both
types of analysis (stimuli 1, 2, and 3, indicating real pitch effects), some
yielded high correlation scores and poor identification scores (stimuli 4, 5
and 6, suggesting timbre effects), and one yielded low scores on both counts
(stimulus 4, suggesting neither pitch nor timbre sensations).

RELATIONSHIPS

Although the attributes pitch and timbre have been
defined as separate concepts, they are at least partially tied to the same
physical attributes of sound. One may wonder, therefore, to what extent the subjective
attributes themselves are dependent or independent of one another. It appears
that this is different for speech and for music.

In speech, a pitch contour is a rather continuous
melody-like pattern that is evoked by vowels and voiced consonants, and is
directly related to the vibration rate of the vocal cords. It carries prosodic
information, and control of intonation patterns is subject to rather strict,
language-dependent rules ('t Hart, Collier & Cohen 1990). The timbre of a
speech sound, although not commonly named this way in the speech literature, is
different for each phoneme and depends physically on the shape of the glottal
air flow pulse and the instantaneous shape and length of the vocal tract
(throat, oral and nasal cavities). It seems that the two work quite
independently of one another. A same sentence can be spoken in different
intonation patterns, or even without any intonation (whispered), and still
sound perfectly intelligible and natural. If natural speech is artificially
manipulated, however, some bounds to this independence are found, as is
illustrated by an example presented in the next section.

In music (including singing) there is the extra
constraint, not found in speech, that sounds usually
occur simultaneously and are intended to create well controlled sensations of
consonance and dissonance. This imposes tight constraints on the choices of
pitch steps (frequency ratios of scales) and timbres (frequency contents of
instruments used). Traditional tone scales, from the Pythagorean tuning to the
modern equally tempered tone system, were developed to minimize dissonant beat
sensations and maximize musical flexibility (ability to play in different
keys). The implicit, underlying assumption always seems to be that the musical
sounds to be used are harmonic sounds. This is true for the human voice, wind
instruments and bowed string instruments, but only approximately true for free
stringed instruments (piano, guitar) and tonal percussion instruments
(xylophone, carillon bells). Today it is easy to create computer sounds with
any desired degree of inharmonicity. It is important
from a musical viewpoint, and possible from a computational viewpoint, to
create the most appropriate tone scale, given any complex tone structure (Sethares 1993).

OBJECT RECOGNITION

The subject of timbre perception has in some recent
publications become closely tied to the topic of auditory object recognition
(Handel 1995). This is a development quite analogous to what already has
occurred in the speech perception tradition. A somewhat simplistic theory would
be that each musical instrument has its own unique timbre over its entire
playing range. It is easy to see that such a theory leads to serious problems.
No two bassoons sound exactly alike. Never­theless, we are able to recognize
each bassoon sound as such. This problem is the same as the invariance problem
found in the speech literature (Stevens & Blumstein 1978). It is also well
documented that spectrotemporal profiles vary quite
drastically from one note to the other when going through the playing range of
a musical instrument (Brown 1991). Conversely, if one synthesizes a tone scale
of an instrument by giving all notes the same relative spectrotemporal
profile, the instrument sounds very unnatural and is unrecognizable as such (Houtsma, Rossing & Wagenaars 1987). Therefore, if a particular constant timbre
would be associated with the sounds of a musical instrument, the relationship
between timbre and physical sound attributes would become very loose or
nonexistent. This would be very unappealing. Such a constant-timbre theory
would also be unable to account for the fact that we can often hear the
difference between two instruments of the same kind.

It is much more likely that object recognition occurs
in steps on two different levels, as is illustrated in Fig. 3. On a perceptual
level a transformation is made from the sound, represented by a point in
complex physical space, to sensation represented by a point in a complex
perceptual s ac . This latter space can be divided
into loudness, pitch and timbre subspaces. The space is continuous, and any
point in this space is potentially meaningful. Any change from one point to
another in this space is detectable as long as the change is large enough to
exceed internal noise constraints. Object recognition occurs at a more central,
cognitive level where points or contours in perceptual space are through daily
experience associated with certain explicit labels such as a vowel /a/ spoken
by a child, an automobile horn, or a series of notes played on a clarinet. This
process is essentially identical for speech, music and environmental sounds.

Finally, in trying to study timbre
perception separately from object recognition it maybe interesting to think about
the feasibility of a timbre-matching experiment. After all, loudness and pitch-matching
techniques are very common in psychoacoustics.

Fig.3 -
Schematic representation of auditory stimuli at the physical,sensory and
cognitive levels

[cognitive=tune,word,cardoor],[sensory=pitch,timbre,loudness],

[physical=frequency,spectrum,intensity]

One can match
the loudnesses of two sounds that differ in frequency
(Fletcher & Munson 1933) or spectral content (Houtsma,
Durlach & Braida 1980),
and one can match the pitches of two sounds that differ in intensity (Stevens
1935) or spectral content (Schouten, Ritsma & Cardozo 1962). Would
it be possible to match the timbres of two sounds across differences in
intensity (loudness) and fundamental frequency (pitch)? The author is not aware
of any such experiment reported in the psychoacoustical
literature. It would, for practical reasons, not be an easy experiment to do
for a subject because of the potentially large number of physical parameters to
manipulate, with many dials to be adjusted. Nevertheless, the speech literature
(e.g., Peterson & Barney 1952) shows that formant frequen­cies (Le., peaks in the spectral weight function) for vowels
spoken by a female voice are typically 15 percent higher than the same vowels
spoken by a male voice. Intonation (pitch) patterns of female voices are
typically an octave higher than those of male voices.

Our own
experience with laboratory speech synthesis confirms that if one takes a vowel
sound of a male voice, doubles the frequencies of all partials, and increases
the formant frequencies by about 15 percent, a very
good match of the vowel sound is obtained, spoken by a natural-sounding female
voice. This atests that
timbre matching is in principle possible, although the speech example of a good
phoneme match may not really imply a timbre match. Analogously, one
could expect that a bassoon player, when asked to adjust spectral parameters of
a high-frequency sound to match the timbre of a low-frequency bassoon note,
will consistently come up with settings that correspond to a high-frequency
bassoon note. This does not necessarily mean that timbres are matched.

CONCLUSIONS

The main conclusions from the material presented in this paper are:

1. Because of their
subjective nature, the parameters pitch and timbre should never be presented as
independent variables in perception studies. Doing so would amount to
describing one unknown in terms of other unknowns.

2. The roles of the attributes pitch and
timbre in the perception of speech, music and environment sounds are very
similar.

3. In music, any study of pros and cons
of certain temperaments or tone scales should include a consideration of the
spectral composition of the sounds used to realize the music.

Linking timbre perception too exclusively with
auditory object recognition would be asking for repeating the history of
categorical perception in speech.

REFERENCES

American National
Standards Institute (1973). American national psychoacoustical
terminology.S3.20. New York: American Standards Association.

Berger, K.W. (1966). Some factors in the recognition of timbre.
Journal of the Acoustical Society of
America,
36, 1888-1891.