The Voice as a Musical Instrument

By A.H. BENADE

The preceding chapters of this book have concentrated on
impulsively excited tones that die away-clangs, drum thumps, guitar pluckings,
and the sounds made by the stringed keyboard instruments. It is now time to
consider sound sources that are capable of producing a sustained tone. This
chapter will be devoted to the human voice, after which we will take up the
orchestral brasses, woodwinds, and stringed instruments. (In chapter 16 we
devoted some attention to two sustained ­tone instruments-the pipe organ and
its electronic counterpart-but our interest was restricted to the pitch
relationships of their sounds, and we took no account of the ways in which
these sounds are generrated. )

In the present chapter we will consider how voice sounds are generated
and how these sounds are modified in the mouth and nose cavities before being
radiated into the room, after which we will look into some of the implications
of these operations for speech and for music. Our interest in the sound
production processes of the voice is twofold. On the one hand, the singing
voice has considerable musical significance; on the other hand, several of its
acoustical aspects provide us with a particularly good introduction to much that
is important in the nature of woodwinds, brasses, and bowed string
instruments.

19. 1. The
Voice: A Source of Controllable Sound

One has only to
listen for a moment to a singer to realize that the voice is a sound source
whose pitch is controllable. In physical terms this means that the human voice
can produce acoustic signals having repetition rates that can be varied over a
large range. The fact that a singer can enunciate different sustained sounds
(e.g., one vowel or another) while maintaining his pitch suggests further that
the other important aspect of a sustained sound-the amplitudes of its
sinusoidal components-is subject to control. It may seem curious in a book on
musical acoustics that we will be giving a fair amount of attention in this
chapter to speech sounds, particularly vowels. They prove to be useful to a
study of musical acoustics for two reasons. First, they are a musical element
of singing quite aside from their information-carrying function. Second, the
ways in which recognizable word sounds are shaped out of the original
relatively featureless vibration recipe from our vocal cords can give us
considerable insight into acoustic connections between tone color, pitch, and
the strengths of the partials we hear.

The
relationship between vowel sounds and tone color can be illustrated if we
imagine building a pair of musical keyboard instruments; one instrument uses
the sound component recipe for a particu­lar vowel sung at C4 as a basis for
constructing its tones (by transposition), while the other similarly made instrument
uses the recipe for a different vowel sung at the same pitch. We would be
unanimous in recognizing that the two instruments have distinctly different
tone colors, even though very few of us would recognize that the sounds from
the two keyboards were copies of spoken vowels. Contrast this with what happens
when two of your friends sing or enunciate a wide variety of words at a wide
variety of pitches; their voices will retain some kind of overall tone color or
flavor through all this that allows us to recognize them as the voices of
specific people. Obviously musical sounds, including voices, have a tone color that
is connected in a nontrivial fashion to their vibration recipes, quite aside
from processing complication introduced by room acoustics.

It
is fortunate indeed for our present purposes that the human voice mechanism
separates itself very easily into unambiguously recognizable functional parts,
each of which can be thought about in isolation. Once we have examined the
various parts separately, we can put everything back together to make the
central part of what Peter Denes and Elliot Pinson of the Bell Telephone
Laboratories have called the speech chain.' In our investigations in this
chapter we will focus our attention almost entirely on the vibration physics of
vocal sound production; this means that we plan to ignore the mental and
neurophysiological processes governing the selection and formation of voice
sounds.

Figure 19.1 is
a block diagram of the voice mechanism as it concerns us. The labels within
most of the boxes give ordinary names to the various physiological objects with
which we are dealing, while the words written above these boxes describe the
acoustical function or nature of these objects. The box marked "sibilants,
etc.," does not quire fit into the labeling scheme just described. It
serves simply as a graphical device for reminding us that the production of
sounds like .r, rh, k, t, and th
involves an auxiliary, broadband (multicomponent) random source which can be
located almost anywhere within the vocal cavity region. When speech sounds are
made, the larynx may or may not itself be vibrating to produce an oscillatory
flow of air; it is this choice that makes the distinction between the voiced
and the unvoiced consonants.

We
may quite properly think of the larynx as being what we defined in chapter 11
as a simple source. This simple source feeds into a small, very elongated
(i.e., more or less one-dimensional) room of complex shape formed by the vocal
cavities. Our study in chapter 11 of the acoustical response of rooms to
excitation by such a source should have prepared us for the idea that the sound
pressure at any given point in the vocal cavity (away from the source) will
depend drastically both on the excitation frequency and on the point of
observation. We should also recognize that (wherever we observe it) the
acoustical response will be particularly large if the excitation frequency
components of the source match one or another of the characteristic vibrational
modes of the cavity.

Over
and over in this book we have met examples of the way in which alterations in the
structure of a vibrating object, and more particularly of its boundaries, can
alter the frequencies of its characteristic modes. In the course of speaking or
singing, one continually alters the shape of one's vocal cavities. The
production of each particular vowel or consonant is associated with a fairly
well-defined shape for the cavities, and therefore with a particular pattern of
strong and weak responses to the various sinusoidal components of the airflow
controlled by the vocal cords.

As
we explore what happens inside the vocal cavity to the sound produced by the
vocal cords, we will confine our attention to what happens at the mouth
aperture. (The nose aperture, which is also used separately or with the mouth,
has very similar properties; therefore we need make no further mention of it.)
At the mouth opening, the oscillatory flow of air depends on the relation
between the excitation frequency (from the larynx) and the various resonances
of the vocal cavity. The mouth, of course, also has acoustical importance since
it serves as the source for sounds as we hear them in the room. (The specific
things going on acoustically inside the vocal cavity that we do not have time
to explore are well understood. Research is done by using a tiny probe
microphone to measure the sound pres­sure set up at various points inside the
vocal cavity; also, motion pictures have been made of the movements of the
vocal cords.)

In
the next two sections we will first consider the way in which the flesh folds
that are known as the vocal cords set themselves into oscillation at a
frequency corresponding to the speaker's or singer's desired pitch, and then we
will enquire into the particular ways in which the resulting oscillatory flow
from the larynx has its vibration recipe modified on its way through the vocal
cavities to the room and thence to our ears. The various patterns of these
modifications are what make different voice sounds recognizable.

The vocal
cords, which do the actual vibrating in the larynx, are flaplike folds of
muscle attached to the interior of the larynx in such a way as to produce a
slitlike opening through which air can pass. The cords are capable of assuming
a wide variety of shapes and spacings. When we breathe normally, they pull
themselves back out of the way, so as to leave an unobstructed air passage.
When we whisper, they are held close enough together that air flowing between
them generates a rushing or hissing sound made up of roughly equal amounts of
all possible frequency components ("white" noise); the vocal tract
can operate on this random collection of closely spaced sinusoidal components
to produce intelligible speech, even though the sound has a radiated sound
pressure spectrum in the room quite different from that of normal speech. When
one phonates (produces vocal sound) normally, the cords are given a shape and
spacing that permits the aerodynamic forces which arise from the air flowing
between them to set them into oscillation. However, the speed of the airflow
only slightly influences the frequency of this oscillation; the predominant
control comes from the mass of the vocal cords and the muscle tension set up in
them. The oscillation of the cords is of such a nature chat they alternately
approach one another and recede, bringing about a corresponding oscillatory
decrease and increase in the amount of air chat is permitted to flow between
them. Not only can the speaker choose the frequency of oscillation of the cords
(and so the pitch of the resulting sounds), he can also choose to have the
cords swing with sufficient amplitude that they can press together during a
controllable portion of each oscillatory cycle. Under these conditions, the
flow consists of momentary puffs of air whose duration can be adjusted more or
less independently of their repetition rate. As a result the singer is provided
additionally with an adjustable recipe for his internal sound source, and
therefore with one of his means for altering the tone color of his music.

As
an initial step in our quest for understanding how the air passing between
vocal cords can maintain their oscillations, we should remind ourselves of a
few facts about the motion of fluids and some of the initial consequences of
these facts. Most of us are quire familiar with these facts in an everyday way,
even if we have not thought about them formally or tried to describe them in
words. Because of their basic importance to our understanding of many things
we will examine in the rest of this book (not just in connection with the
maintenance of oscillations), I shall set down these basic ideas as the first
few members of a set of numbered statements to which we can easily make
reference whenever the need arises.

1.
Fluids (including air) tend to flow from regions of high pressure coward
regions where the pressure is low.

2.
As a consequence of the influence of pressure on fluid flow, we recognize that
if we see an increasing flow velocity of a fluid as it moves from one point to
another in its travels, we can deduce that the pressure at a high-velocity spot
must be lower than at the low-velocity point from which the fluid came. One
cannot speed anything up without arranging to have an excess of force acting
behind it.

3.
When a fluid flows steadily and continuously in a long duct, we expect the
velocity of …the duct than in the wider parts.Statement 3 is simply a
recognition of the tact that, for fluid flowing in a leakfree duct. a fixed
volume of fluid passes any given point per second. Where the pipe crosssection
is large, many small ..chunks.. of the slow-moving fluid travel abreast of one
another; in the narrower parts these must run quickly through the constriction
in single file.

4. A joint implication of statements 2 and 3 is that we should expect
the fluid pressure in the narrow parts of a long duct to be lower than it is in
the broad parts.

The argument leading to statement 4 runs
thus: in a leak-proof pipe any given small chunk of fluid (which you might wish
to identify by squirting in a tiny droplet of oil) finds itself accelerating to
a higher velocity as it enters a narrow region, and then slowing back down as
it continues on into a broader part of the duct. Looking at things from the
point of view of the small piece of fluid, we realize that it will not change
its state of motion unless a force acts on it. It speeds up as it enters a
constriction; therefore, the pressure behind it must be greater than in the
constricted region it is approaching. Similarly, it slows down as it leaves the
constriction; therefore, an excess pressure must be acting on its front surface
to retard it. The quantitative expression of statement 4 and an elucidation of
some of its remarkable consequences were first worked out by the Swiss
physicist Daniel Bernoulli in 1738. The formal expression of our statement 4 is
known as Bernoulli's Theorem for Steady Flow.

5. The presence of viscous friction that is normally found in a fluid
and between the fluid and around containing walls does not changethe qualitative
correctness of statements 1 through 4. However, it leads to a reduction in
the total amount of fluid that pauses through the system per second under the
influence of a given driving pressure of the source.

We
are now provided with the information needed for a look at the vocal cords in
their role as oscillators. If a mechanical engineer were asked to design a
simplified machine that worked in much the same way as the vocal cords, he
might very well come up with something of the sort shown in figure 19.2. Air
from the lungs flows in the diagram from left to right through a large-diameter
duct (A) which corresponds to the
windpipe or trachea. The air then flows through a constriction (B) and out
again into an enlarged portion of the duct (C), which is the beginning of the
vocal tract. The upper boundary of the constriction consists chiefly of a mass
M mounted on a spring having a stiffness coefficient S, the mass being free to
oscillate smoothly up and down along a carefully fitted guide. This guide is
made leakproof by means of some grease, which also serves to lubricate the
guide. Our engineer has chosen to represent one of the two vocal cords by this
spring-mass system (with viscous damping D provided by the sealing grease). The
other cord would move symmetrically with the first under the influence of
similar forces, and so can be left out of our initial consideration.

If
no air is sent through our iron larynx, it is easy for us to see that the
natural frequency of oscillation of the mass M is proportional to the quantity
ÖS/M, and that
if it is pulled aside and released, the oscillations will die away with a
halving time proportional to M/D (see sec. 6.1). It is this natural frequency
which the singer changes as he shifts from one musical pitch to another.

If
the airstream is turned on, we recognize on the basis of statement 4 above that
the air pressure at (B) will be reduced relative to what it is both at (A) and
at (C). If the mass moves downward, further constricting the opening, two
opposing things will happen. Narrowing the aperture will increase the speed of
the air motion at (B), as a result of which the pressure here is also reduced,
thus tending to suck the mass even farther down. On the other hand, the added
frictional resistance produced in the narrowed opening will (if the lung
pressure is kept the same) reduce the total volume of air that flows past per
second. As a result, the flow-dependent pressure will not change in quite the
way we would otherwise expect. When everything so far is taken into account, we
find that the presence of flowing air causes it to feel an aerodynamic force
that has two recognizable components: a steady inward force, plus one which
fluctuates as the mass vibrates in and out. We shall call this last,
fluctuating part the oscillatory Bernoulli force.

Let
us see how the presence of flow can be expected to modify the sinusoidal
oscillation which would normally result from the interaction of the spring with
the mass. The steady part of the flow-induced force pulls M in against the
elasticity of the spring to a new equilibrium position in which the aperture is
slightly reduced. We find further that as the mass oscillates, the other
flow-induced force component acts along with the spring as an additional
restoring force tending to pull the mass back toward its altered equilibrium
position. It is thus perfectly permissible for us at this stage in our thinking
to consider the joint action of the spring and the airflow as being equivalent
to the action of a single spring having
a somewhat larger stiffness coefficient.
The conclusion follows then that the natural frequency of oscillation of our
imitation vocal cord is slightly raised by the existence of an airflow past it.
Notice, however, that we have not yet found anything that can counteract the
damping effect of the lubricating grease. In other words, we have not yet
discovered any means whereby the flowing current of air can initiate or
maintain oscillations of the vocal cord.

Let
us digress a moment now and examine the motion of a child on a swing, and
notice what we must do while pushing him. This examination will suggest to us
what to look for in the larynx, which is a device whose cords are of course
known to oscillate. As a child swings back and forth, we recognize first the
springlike restoring force that arises from the joint effect of his weight and
of the oblique rope which supports him. As we learned in chapter 6, this force
acts in a direction opposite to the child's displacement; it determines the
frequency of oscillation according to a familiar formula. Once the child is
pulled to one side and released, he swings in ever-decreasing arcs; the
decrease is the result of the viscous friction of the air through which he
moves (see fig. 10.5). Notice that the viscous friction is a damping force that
acts in a direction opposite to the motion
of the child. The contrast between the restoring force and the damping
force can be made clear if we realize that the restoring force is zero at
midswing, where the damping force on the rapidly moving child reaches a
maximum. Conversely, the damping force falls to zero as the child comes to rest
at the limits of his travel, which are the points at which the restoring force
has its largest value.

If we wish to maintain the swingingmotion of the child, it seems pretty obvious that it
is necessary to do our pushing in the direction of the child's motion. More
accurately, we realize that if we push on him over an appreciable fraction of
the time of one cycle, at least the predominant
share of our pushing should take place in the helpful direction. Let us
distill these ideas into the sixth of our numbered statements:

6.
Because the damping force on a vibrating object always acts to oppose the
motion of the object, any successful attempt to maintain the oscillation
requires the application of a periodic force that acts (at least predominantly)
in the same direction as the motion.

Let
us now go back to our artificial larynx to seek the missing force contribution
that meets the requirements laid down in statement 6. Our model at this point
is too simple in that it takes insufficient account of the fact that the
airflow is by no means steady: it increases and decreases as the valve opens
and closes. In the case of unsteady flow, Bernoulli's theorem does not quite
hold true. Because of the inertia of the moving air, the velocity of air
flowing through a constriction cannot instantaneously readjust itself as the
aperture is changed. In other words, the sinusoidally varying aperture
determined by our oscillating mass has passing through it an airflow whose
variations lag behind by a small amount.

Figure
19.3 will allow us to see how the oscillation is maintained. At the top of the
diagram we see a curve that represents the sinusoidal up-and-down oscillations
of the mass M. The bottom part of the figure shows the corresponding varia­tion
of the flow-induced oscillatory Bernoulli force that acts upon it. Notice that
the force reaches its upward and downward maxima at instants of time that are
slightly later than those at which the maximum excursions of the mass itself
take place. To help us recognize the relationship between the Bernoulli force
and the direction of motion of M, all parts of the displacement curve that
correspond to downward motion are so labeled, and they are also drawn using a
beaded line. In similar fashion, those parts of the force curve that represent
a downward urging on the mass are labeled and drawn with a beaded line. The
parts of the two curves corresponding respectively to upward motion and upward
force are also labeled, and are drawn using plant lines. In the middle area of
figure 19.3 we find a series of
shaded boxes which call our attention to those periods of time during which the
Bernoulli force acts in the same
direction as the motion of the vocal-cord surrogate M. These are the times
during which the force contributes to the maintenance of oscillation. Notice
that these intervals*of "helpful" interaction are longer than the
intervening periods during which the force tends to diminish the oscillation.
The net action is therefore of the sort needed for the maintenance of
oscillation, according to the requirements of statement 6.

Detailed study of our mechanical model of the larynx shows that it has
all of the major properties of the real larynx, but lacks some of the subtler
features.

James Flanagan
and his coworkers at the Bell Telephone Laboratories have found, however, that
almost everything can be well accounted for with only a slight elaboration of
our simple machine.' All that is required is the provision of two adjacent
movable lumps of matter, each with its own spring and damper, plus a coupling
spring between them. This makes the whole larynx model into a cousin of the
two-mass chain, with consequences some of which you will be able to guess with
the help of what is said in sections 6.3 and 10.5.

We
will close this section with a brief look at the actual flow patterns (and
their sinusoidal components) that come through the larynx to act as a sound
source for the rest of the vocal system.' The patterns range between the two
limiting forms shown in figure 19.4. The top part of the figure shows the
successive puffs of air produced when a man sings a note a little above G2 (100
Hz) with a relatively high breath pressure and fairly close initial spacing of
the vocal cords. Notice first of all that the successive puffs of air are quite
uniformly spaced (0.01 seconds apart), giving a well-defined repetition rate.
This tells us that the partials are harmonically related. During each puff, the
flow rises fairly quickly to a somewhat spiky peak, and then decreases in a
slightly wiggly fashion. Notice further that the flow ceases completely for
about one-third of each cycle, during the interval when the two cords have
pressed themselves together.

The
lower part of figure 19.4 shows the other extreme in voice production. A gentle
stream of air is sent past the cords, flowing just strongly enough to keep them
vibrating. The cords do not close completely, however, so that the flow is
never shut off altogether. The waveform here is not as spiky as before, being
shaped more like a slightly skewed sinusoid. We will postpone until later in
the chapter any consideration of the implications of the slight irregularities
existing between successive pulsations.

The
vibration recipes for the two flow patterns illustrated in figure 19.4 differ
chiefly in the relative amplitudes of the first half dozen pairs of
corresponding partials. In the spiky waveform, partials from 1 to about 6 are
of roughly equal amplitude, whereas above this the amplitude of the nth
component is about 1/n2 as large as that of the first partial. In
the more rounded signal, the 100-Hz fundamental component is considerably
stronger than the other harmonic components, say 4 or 5 times the amplitude of
partial 2, after which the amplitudes fall away with extreme rapidity.

For
ordinary speech we may safely assume a pattern of flow intermediate between the
two we have just considered. This intermediate pattern has a slightly skewed triangular
shape. The flow is reduced to zero only momentarily, and the pattern shows a
slightly rounder! peak at the top. This shape is almost precisely what one sees
at the start-up of a guitar string that is plucked somewhat to one side of
center. This means that if we want the recipe for a typical intermediate voice
sound, we can take over exactly the same recipe described in section 7.2, as
modified by the corner-rounding explained in section 8.4. That is, the
amplitude An of the nth harmonic partial is primarily related to the
fundamental amplitude A, by the formula An = A1/n2with a few partials being
weakened because their nudes (in time now instead of in space along the string)
lie near the top corner of the waveform (the analog of the plucking point). A communications engineer would
describe a recipe like this as having a few "zeros" in it, with the
shape being outlined by an "envelope" that falls at the rate of 12 dB
per octave.

19.3. Sound Transmission
through the Vocal Cavities and into the Room

The vocal
tract, which extends from the larynx to the mouth (and/or nose) aperture, has
the duty of transforming the rather simple airflow spectrum provided by the
vocal cords into the recognizable acoustical patterns needed for speech and
music. We have already learned in broad outline that the larynx, acting; as a
source, feeds one point in an elongated, roughly tubular, one-dimensional
"room" whose set of natural frequencies can be adjusted (by movements
of the tongue, lips, etc.). The mouth aperture is a sort of window at the far
end of this room, acting in its turn as a simple source for the excitation of
the vibrational modes of the three-dimensional room in which we can imagine we
are listening.

The
pressure variations produced by the larynx in the vocal tract, and thence the
strength of the resulting source at the mouth, depend in a simple way on the
adjustable resonance properties of the vocal tract. The pressure amplitudes
produced for the various voice partials in the room surrounding the listener do
not, however, have a simple proportionality to the strengths of the
corresponding airflow components from the mouth. Simple sources radiating into
a three-dimensional room have the fundamental property (mentioned earlier in
connection with the discussion of figure 11.3) that the room­averaged sound
pressure resulting from a given source strength is larger for high­ frequency
sources than for those oscillating more slowly. More precisely, for every
doubling of frequency, there is a doubling of sound pressure in the room,
provided the source strength is kept constant. A telephone engineer would say
that the sound pressure in a room due to a constant-strength source rises at
the rate of 6 dB/octave. The physical explanation of this relative emphasis at
high frequencies is to be found in the rapidly increasing number of
off-resonance room modes whose collected responses make up so much of the sound
in a room (see sec. 11.4). There is no corresponding increase in the number of
modes at high frequencies in a one-dimensional (i.e., long and narrow) room,
which explains why we do not find a similar "treble boost" taking
place at the junction of larynx and vocal tract.

In
addition to the systematic effect of the mouth's radiation behavior on the
sound pressure recipe, we need to take into account the fact that our ears
themselves have progressively greater sensitivity for high frequencies (up to
about 3500 Hz) than they have for lower frequencies. In what follows, both
effects will be taken into account, and the discussion will be confined to the
loudnesses, expressed in sones (see secs. 13.4 and 13.6), of the individual
voice partials that some­one would perceive if they came to his ear one by one,
on the assumption that he is listening only a short distance away from the
mouth of the singer or the person speaking. We will give the name loudness recipe or loudness spectrum to the description of the strengths of the
various partials calculated in this way for a given vocal tone.

The
top part of figure 19.5 shows the loudness recipe that is typical of the vowel
[ah] steadily pronounced as in the word father
by a man who pitches his voice 35 cents above G2.[4] The sinusoidal
components of his voice sounding at this pitch will be exact multiples of 100
Hz. If the fundamental component of this sound reaches the listener's ear to
produce a loudness of a trifle over two sones (as shown), the second partial
would be heard at about 4.2 sones, etc. Notice that partial number 7 is very
loud. We notice further that the loudness of the 11th partial is also greater
than that of its adjacent neighbors. In similar fashion the 26th harmonic is
also emphasized in the overall loudness spectrum of our 100-Hz tone.

The lower half of figure 19.5 shows the
loudness spectrum associated with a 220­ Hz (A3) tone produced by the same man
if he keeps his jaw, tongue, and lip positions unchanged from those used for
the 100-Hz tone. The pitch of this tone is somewhat more than an octave higher
than the first, but we would still agree that the same (ah) vowel is being
produced. Notice that the overall shapes of the cases we find a particularly
strong component in the region from 600 to 700 Hz, another near 1100 Hz, and a
third one lying near 2600 Hz that is louder than its neighbors. In between
these loud components we find weaker ones, and the strengths of these in the
two tones are quite similar as long as we
confine our attention to some particular frequency region. For example, the 20th
partial of the 100-Hz tone and the 9th one of A­220both lie close to 2000 Hz and have loudnesses of about 2 sones.

The common element of the two differently pitched [ah) sounds that we
have examined is the presence of especially strong components near 700, 1100,
and 26,00 Hz, and the existence of frequency regions near 900 and 2000 Hz and
below about 300 Hz in which the partials are especially weak. The explanation
of these peaks and dips in the loudness spectrum is easy to find the peaks
correspond to the characteristic frequencies of the particular vocal tract air
column used by our subject when he is asked to pronounce the vowel [ah), and
the dips arise from the tendency for cancellation between the in­ phase
responses of a higher mode driven below resonance and of a lower mode driven
above its natural frequency. These matters were carefully discussed in section
10.5.

What is often called the spectrum envelope of the [ah) sound is a smooth
curve drawn to indicate the pattern of loudness of this vowel, regardless of
what fundamental voice frequency is used for its production. This spectrum
envelope is almost exactly the ordinary resonance response curve measured
between the point of original excitation and the position of the detector.
Figure 11.3 is an example of such a
curve measured between two points in a room, while fig.10.14 shows the
corresponding transmission for vibration between points on a metal tray. In
this chapter we are using a slightly modified version of these transmission
curves, since we want to make allowance for the properties of the ear itself.

The middle part of figure 19.6 is the loudness spectrum envelops for
[ah]; the top and bottom parts of the figure show the corresponding envelopes
for the vowels [oo), the middle sound of the word pool, and (ee) whose sound is found in the word feet. Each recognizable vocal sound that
we produce is associated with its own particular arrangement of characteristic
mode frequencies for the vocal tract, and each of these is brought about by a
particular shaping of the air column.

We are now in a position to summarize and slightly extend the basic
ideas of vocal sound production as we have met them so far. This summary is an
abbreviated paraphrase of the opening remarks in the present-day classic study,
Acoustic Theory of Speech Production, by
the Swedish scientist Gunnar Fant, who is director of the Speech Transmission
Laboratory at the Royal Institute of Technology in Stockholm. [5]

1.
The vocal cords oscillate at a frequency determined primarily by their mass and
tension, with frictional losses being restored by means of aerodynamic
(Bernoulli) forces produced by the stream of air from the lungs.

2.
This oscillation of the vocal cords transmits roughly triangular puffs of air
into the vocal tract. The repetition rate of these puffs is equal to the
vibration rate of the cords. The vibration of the cords, and therefore the
shape of the resulting puffs, varies slightly from cycle to cycle, even when an
attempt is made to generate a perfectly steady sound.

3.
A voice source (as heard in the room) is characterized by a spectrum envelope.
Each vowel (and consonant) sound that one may wish to produce has its own
characteristic spectrum envelope. The peaks and dips of any such spectrum
envelope are determined by the frequencies of the characteristic vibrational
modes of the corresponding vocal tract configuration.

4.
The peaks that are observed in the spectrum envelope are called formants. Conventionally one assigns an
identifying serial number to these formant peaks, formant 1 being the one
having the lowest frequency.

5.
For males the first formant peak of any vocal sound lies in the frequency
region between 150 and 850 Hz, the second in the range between 500 and 2500 Hz,
and the third and fourth in the 1500-to-3500-Hz and 2500-to-4800-Hz regions.

6.
As a consequence of the one-dimensional, long and narrow nature of the vocal
tract, the average spacing of the formant frequencies is roughly constant. Its length
is such that for males the average spacing is about 1000 Hz. Because of these
limitations, it is not possible for a person to achieve every arbitrarily
chosen pattern of formants within the ranges given above.

7.
Two people uttering the "same" sound will generally use slightly
different formant frequencies, partly because of differences in their regional
accent, and partly because of differences in the dimensions of their vocal
tracts. Women's formants generally lie about 17 percent higher, and children's
about 25 percent higher, than those typical of men.

8.
The first three formants dominate the recognizability of speech, and much
intelligibility is retained if only two formants are present.

The
importance of the formant peaks, and in particular of the frequencies of these
peaks, suggests that a sound made up of a few inharmonically related sinusoids
each of which is marched to one of the formant frequencies of a particular
vowel might be heard as giving that particular vowel. For example, we might
guess that the [ah) sound could arise from the simultaneous sounding of
components at 700, 1100, and 2600 Hz, or that too) would be produced by
components at 300, 625, and 2500 Hz. This does not in general prove to be the
case.

We
consider next the much more serious problem of the possibility of ambiguity in
the recognition of a given formant pattern, and learn of the way in which our
ears exploit the information available to them to resolve the ambiguity.
Suppose for example that our experimental subject is asked to produce exactly
the same [ah] sound that led to the spectra shown in figure 19.5, except that
he is to use a frequency of 440 Hz as the fundamental frequency rather than the
100- and 220-Hz values he used before. For a man to sound a 440-Hz tone
generally requires a shift to what is called the falsetto, a type of sound production that is understandable in
terms of a double-mass vocal cord model in which the motion is a combination of
mode-1 and mode-2 os­cillations. The relationship between walk­ing and running
is an analogous piece of physics in which we recognize differing combinations
of two characteristic modes of oscillation. The loudness spectrum for the
higher-pitched 440-Hz sound is readily deduced from the one appropriate to the
220-Hz tone an octave lower: one has only to obliterate the odd-numbered
components from the lower diagram in figure 19.5. Elimination of the odd
components appears (at least on paper) to do a rather destructive thing to the
recognizability of the formant pattern, since the strong components at 660 and
1100 Hz are eliminated, along with the noticeably weak one close to 2000 Hz.
The remaining partials (harmonics of 440 Hz) are indicated in the diagram by
crosses drawn above each one of them, so that your eye can more easily
visualize a rather broad implied formant hump extending from around 200 Hz to
nearly 1500 Hz, to­gether with a spike at 2640 Hz belonging to the strong 6th
harmonic of the 440-Hz tone. Comparison of this implied spectrum envelope with
the envelope given for too) at the top of figure 19.6 shows that the two have a
very similar appearance. This means that these two vowels would be hard to
distinguish when spoken at a pitch corresponding to 440 Hz. There would of
course be no difficulty in distinguishing the 440-Hz version of [ee] from the
other two sounds.

The
resolution of the ambiguity proves to be straightforward. The fact that the
repetitive motion of the vocal cords is not precisely regular (due in part to
inescapable muscle tremor and in part to certain aerodynamic instabilities of
flow) means among other things that there is a continual fluctuation of the
fundamental frequency-a sort of random vibrato. A typical extent for this
fluctuation is 0.5 percent, corresponding to variations of 2.2 Hz, 4.4 Hz, and
6.6 Hz at the first three harmonics of 440 Hz. Since the component near 440 Hz
is fluctuating a little in frequency, the strength of this partial also
fluctuates as the excitation slides up and down on the resonance curve of the
vocal tract. For instance, an upward fluctuation of frequency brings this
component closer to the first formant resonance, and so increases the loudness
of what we hear. At 440 Hz, then, our ear is supplied with the information that
the spectrum envelope curve is steeply rising toward high frequencies (verify
this by looking at the slope of the curve for [ah) at 440 Hz in fig. 19.6).
This tells our ears that a formant peak lies a little above 440 Hz. In an
exactly similar fashion, fluctuations of the 880-Hz second partial inform us
that in this neighborhood the spectrum envelope is roughly horizontal (i.e.,
this component lies at either the top of a formant peak or at the bottom of a
dip in the spectrum envelope). To continue, the downward slope to the response
curve brought to light by fluctuations of the third harmonic (around 1320 Hz)
implies the existence of a formant peak lying below this frequency. Let us put
these various pieces of information together now to see how completely the
ambiguity has resolved itself. The behavior of partial 1 tells us there is a
peak on the high-frequency side of it. This missing peak must lie between
partials 1 and 2 since partial 2 could not possibly be at the top of a peak and
still match partial 1 in loudness. A similar argument establishes the presence
of formant 2 between partials 2 and 3.

There
is an even more clear-cut way in which our hearing process manages to keep
track of the formant locations that might otherwise sandwich themselves between
the voice harmonics. In speaking and singing, one is constantly going from one
sound to another, and each formant moves smoothly from its position for one
part of the utterance to that belonging to the next part. If the pitch is
maintained constant throughout, we have the spectrum envelope moving past the
fixed voice harmonics to plot out their shapes in time, just as we earlier
found that pitch fluctuations are able to explore the shape of a fixed formant
pattern. In actual speech and singing, of course, both processes are going on
continually as we raise and lower the pitch of our voices and simultaneously
change the formant patterns belonging to the separate parts of vhf words we are
enunciating.

19.4. The Male
Voice and the "Singer's Formant"

The
bass-baritone voice can be thought of as a musical instrument whose lowest note
has a fundamental frequency lying in the region of 80 Hz (near E2 with its top note (near F4) having a
fundamental in the neighborhood of 350 Hz. In this section we will seek some of
the musically relevant elements that characterize the tones of this vocal
instrument (which elements are typical also of the higher male voices), and
learn how the singer can make alterations in his mode of tone production. We
will ignore the verbal communication aspects of singing, considering only those
musical effects that might be noticed by a listener who is not acquainted with
the language being sung.

The
relatively stable and featureless source spectrum generated in a singer's
larynx is operated on by his vocal tract to produce the elaborately shaped and
rapidly varying audible spectrum that comes to our ears as the singer goes from
note to note and from vowel to vowel (see secs. 19.2 and 19.3). While we are
listening to a singer, our nervous system (in the midst of its many other
duties) deduces a kind of running average and seeks correlations over
successive brief but overlapping spans of time; this continual processing gives
us a good perceptual idea of the common element in the singer's varied sounds,
this common element being the source spectrum generated by his larynx. When the
puffs of air are short and spiky, we say that the singer is using a light or
bright voice. The darker voice colors are associated with a rounded,
smoothed-out pattern of airflow (see fig.
19.4). .

Digression on
the Extraction of Average Properties: The LTAS.

The following laboratory technique is based on a much
simplified cousin of the way in which our nervous system works to extract the
common elements of a sound. A sound it tape-recorded over a suitably chosen
interval of time; this tape is then made into a loop and played over and over
into an electronic analyzer that picks out successive frequency bands (.ray SO
or 100 Hz wide) and measures the aggregate strength of the partials lying
within them, averaging the results of each measurement over the entire duration
of the passage. If we wish to apply this procedure to a singer's voice, the
recording must be long enough that the singer has had tine for several
repetitions of a substantial fraction of his voice's repertory of pitches and vowels. Under these conditions, the long time
average spectrum (abbreviated LTAS) gives of something that is a close cousin
to the larynx spectrum as modified by the "treble boost" property of
the mouth­-to-room coupling. The peaks and dips of the vocal tract transmission
for various enunciations tend to average themselves out when various pitches
are sung in an LTAS, leaving evidence of their statis­tical aggregate in the
form of a somewhat accentuated region near 450 Hz, analogous to the mouth­
aperture trend toward accentuated high frequencies that was just mentioned.
(while the LTAS technique has many uses in the study of musical sounds, one
cannot use it trivially to deduce such things as the flow spectrum at the reed
end of a woodwind, or the force spectrum at the bowing point of a violin,
despite their apparent analogy with the excitation spectrum from the larynx.)

In
the above digression and the immediately preceding paragraph we have con­sidered
an aspect of vocal sounds whose description remains fairly constant even when
the singing pitch is altered. It proves possible to make statements of the
sort, "we learn from a certain singer's LTAS that the higher partials of
his voice become successively weaker at the rate of 12 dB/octave," without
having to specify the repetition rate of the source. The musical relevance of
this possibility comes at present from the fact that our hearing mechanism is
able to extricate an auditory version of this same information. It is time now
to look at the interplay between a constant element of a given vowel (its
formants) and the variations in pitch that are the basis of singing.

The
fact that the upper two-thirds of the bass-baritone singing instrument's range
overlaps the lower third of the 150­to-850-Hz range of the first voice formant
guarantees the impossibility of specifying the amplitude relation between
successive partials measured in the room without also specifying the singing
pitch. Thus we deduce from figure 19.6 that the 700­Hz first formant for the
vowel sound [ah] lies three octaves plus about a semitone above the 80-Hz
bottom note singable by a typical male voice, so that the strongest partials of
the E2 note (as we hear it) will be the 8th and 9th. On the other hand, the top
note of our hypothetical male singing instrument has a 350-Hz fundamental
frequency, so that when it sounds [ah] while singing F4, this same first
formant will cause the 2nd partial to come to our ears most strongly.

We,
as listeners experienced with human speech, would have no difficulty in
recognizing the vowel [ah] as produced by our singer at either of the
above-mentioned pitch extremes. On the other hand, as musicians interested in
tone color who imagine ourselves to be listening to abstract sounds, we might
not be willing to say that the singer produces the same tone color when he
sings [ah] at the bottom of his range as he does at the top of it. Let us
sharpen up the contrast between the musical and verbal versions of our
perceptions with the help of an example mentioned early in section 19.1.
Suppose we tape record the sound of a singer producing the sustained vowel feel
at the pitch G,, and then play this tape back at various speeds so as to
transpose the tone to all the semitones of the musical scale. In this process
the formant frequencies (peaks in the spectrum envelope) are transposed to
higher and lower frequencies, along with the partials of the tone itself. An
engineer would say that the spectra of the resulting tones all have the same
shape, and he could deduce from the bottom curve in figure 19.6 that the
fundamental component (which was originally at 261.6 Hz) is more than 3 times
as loud as partial 2 and about 15 times louder than partial 3 (lying near 784
Hz); partial 6 is almost inaudible since it lies at the dip in the formant
curve (near 1570 Hz), while partials 8 and 9 straddle the second formant peak
and so are about as strong as partial 3. If this description omits the
frequency designations (which were purely explanatory) leaving only the serial
numbers of the various partials and their relative strengths, the above
statements remain true for the entire scale of transposed notes, as already
noted by the engineer.

As
long as we do not wander more than an octave or two on our scale above or below
C4 our musical ears would agree with the engineer's description given above in
the sense that they would recognize that all these [eel sounds have a rather
constant tone color. At a subtler level of listening we would detect a slow trend
toward what many people would call brightness or lightness in this sort of
sound as we go up the scale, and a corresponding darkening as we go down. This
description of relative lightness or darkness, however, is not associated with
quite the same sort of acoustical change that we find associated with these
adjectives when a singer changes the excitation recipe from his larynx.

If
we change our mode of listening to that used in recognizing human speech, we
find, on the other hand, that the sound of our tape playbacks would not preserve the fee) vowel character
very far as we go up or down in the scale from the C 4 starting point. This is
because in playback the formant frequencies themselves are being shifted, thus
destroying the identifying marks of the vowel. To be sure, no trouble at all
comes from going up or down the scale by a major third because this leaves us
within the 25-percent range spanned by the average formant frequencies of men,
women, and children. Experiment shows, however, that a 50-percent shift of the
formants by the transposition of our tones up or down by a musical fifth will
change speech sounds enough to hinder intelligibility seriously.

Opera singers and others who perform with large orchestral accompaniment
have developed several very interesting ways of coping with the problem of
being heard recognizably. While parts of the two phenomena I shall describe
here have been recognized for several decades, our understanding of their
implications has been clarified greatly by the recent work of Johan Sundberg at
the Speech Transmission Laboratory in Stockholm .[5]

Let us first investigate the acoustical nature of the
singer-versus-orchestra audibility problem, so that its solution can be made
intelligible. To begin with, we must be aware that the shape of the long time
average sound pressure spectrum (LTAS, see the previous digression in this
section) of orchestral music is very much the same, whether one measures a
Mozart violin concerto or an operatic overture by Wagner. There are of course
small differences, and loud passages in particular have an LTAS with a slight
increase of their high-frequency components relative to the low-frequency ones.
We can describe the sound pressure level (decibel) version of the orchestral
LTAS by saying that it rises quickly from low frequencies to a peak near 450
Hz, and then falls away with an average slope of about 9 dB/octave. The actual
measured spectrum can be translated into the corresponding loudness curve,
which gives at any frequency the loudness that that particular segment of the
spectrum would have if it were heard by itself. We find here that the peak at
450 Hz has now become very marked indeed, falling to half loudness on the two
sides of the peak at about 150 and 900 Hz. The loudness is roughly constant
from 1000 Hz to about 2500 Hz, above which it decreases steadily to nothing at
the upper limit of hearing. A typical example of this behavior is shown by the
smooth curve in figure 19.7.

The LTAS for
ordinary speech and ordinary singing (but not for singing in the large-scale,
operatic style) has a shape that is roughly similar to what we have just
described as belonging to an orchestra. This remark provides us with at least
an indication that a singer might have problems being heard; he apparently does
not sound very different (in one sense) from the orchestra, and it is unlikely
that he can overpower it through sheer vocal exertion. If the LTAS of an
orchestra and an ordinary voice are quite similar, we would expect a certain
amount of masking to take place (see chap. 13). When one listens in a room to
pairs of sinusoids, fluctuations in
the transmission of both the masking and the masked sound from source to our
ears normally make masking unimportant. However, when there are many components from various sources
having frequencies within the ear's critical bandwidth (about four semitones)
centered around the frequency of the test sinusoid, masking can be a problem.
Sundberg has found in a preliminary way that, for a single sinusoid to be audible
in the presence of a noise source whose spectrum has been given the same shape
as the orchestral LTAS, this single sinusoid must have a pressure amplitude
roughly equal to the aggregate masking sound pressure of the noise that is
within the critical bandwidth surrounding the sinusoid. In a musical
surrounding one might expect the highly organized harmonic components of the
singer's voice to survive masking somewhat better than this, because they can
advertise themselves quire well as a single entity: that is, they have exactly
synchronized beginnings and endings, precisely tracking vibratos, and
well-defined patterns of swelling and diminishing as the formants change during
articulation. All these things prove to be somewhat effective, particularly
since many of these patterns of change are quite different from chose that help
characterize the various orchestral instruments. Nevertheless, the sheer weight
of numbers leads to trouble when one man tries to make himself heard above the
sounds from many. Furthermore, the overall similarity of the orchestral LTAS
and its ordinary vocal counterpart
guarantees that at no place in the frequency range do the voice partials have a
chance to predomi­nate over their orchestral setting, and so "carry"
their weaker brothers to our attention.

The
first of the acoustical alterations cultivated by the operatic singer to help
him in the audibility contest is his habit of singing with a vocal cord
placement and lung pressure relationship that produce short, sharp puffs of
air in the out­put of his larynx. By this means he can, as we have seen
earlier, strengthen the upper partials in his voice. The increased audibility
of these upper partials helps us to follow the rest of his voice components
through the orchestral sound.

The
second large-voice acoustical phe­nomenon we will consider is the so-called singer's formanr. At least 25 years ago
it was noticed that skilled male operatic singers did not sing words with quite
the arrangement of formants that they would use in speaking those same words.
Many of these differences are relatively small, and for present purposes
unimportant. However, there is one very significant alteration that turns out
to contribute enormously to the audibility of a singer who competes with an
orchestra. Tucked in among the other formants of his voice is a very strongly
marked extra one lying somewhere in
the region between 2500 and 3000 Hz. When we measure the various speech sounds
one by one in an operatic singer's voice, we find that this particular formant
has a frequency that is independent of the placement of the other, more
ordinary formants. The enormous contribution of the singer's formant to his
audibility can readily be understood by comparing the loudness LTAS for
ordinary music (solid line in fig. 19.7) with the one obtained by Sundberg for
the tenor Jussi BjorIing singing with loud orchestral accompaniment, which is
shown as the beaded line in figure 19.7.

The
fact that the singer's formant is independent of the placement of the other
formants tells us that this formant arises from resonances in some part of the
vocal tract that somehow escapes the influence of the ordinary changes in its
shape. We can make good use of the ideas of wave impedance (which were first
met in sec­tion 17.1) to help ourselves find the origin of the singer's
formant. The vocal cords form an adjustable closure at the bottom of a small
tube (the larynx tube) which is a little more than 2 cm long. The larynx tube
has a slight bulge at its lower end, and its upper end opens into a somewhat
enlarged throat region which then connects with mouth and nose cavities. The
operatic singer has learned to exaggerate the change in cross-section that
exists at the junction of the larynx tube and the throat, thus increasing the
discontinuity of wave impedance between the two ducts. The second digression in
section 17. 1 explains that if two parts of a large system have drastically
different wave impedances, it is permissible to think about the characteristic
frequencies of each part more or less independently. Sundberg has shown that
the first characteristic mode of vibration of air in the short larynx tube is
associated with the singer's formant. The excitation in the short tube is given
its acoustical identity by the trained singer's ability to provide a strong
discontinuity in the cross-section at its upper end. If the discontinuity is not
emphasized, the larynx tube is merely part of the "room" of irregular
shape called the vocal tract. If we like, the operatic singer's larynx tube
can be thought of as a miniature vocal tract in its own right, whose upper end
serves as a kind of mouth which excites the long narrow room provided by the
rest of the vocal tract. In this way of looking at things, the singer's formant
is the first formant of the miniature vocal tract. In other words, the
oscillatory flow recipe from the larynx is first given, in the short tube of
the larynx, a strongly peaked boost in the 2500-to-3000-Hz region before it is
passed on for a more familiar type of processing by the rest of the vocal
system.

To
summarize, the trained operatic male voice is produced by a singer who has
learned to cope with his orchestral accompaniment by means of several changes
in his acoustical output. First of all, he can generate a flow pattern from his
larynx whose higher partials become progressively weaker at a more gradual
rate than those used in ordinary speech or in a smaller-scale type of singing.
In addition, he has learned (sometimes at the expense of a certain amount of
strain, or even discomfort) to pull the lower end of his vocal tract into a
shape that permits the pro­duction of the singer's formant. Finally, he tends
to use a fair amount of vibrato, which adds a great deal of recognizability to
the various sinusoidal components of his voice by providing them with a synchronized
pulsation in frequency and amplitude (as they sweep across their various
formants). Such synchronized variations in an otherwise complex signal are of
course exactly the sort of things our auditory recognition machine works well
upon. The synchronized pulsations of vibrato are one more common element in the
singer s sound which we can seize upon as our ears pursue his voice through the
music.

The
special skills of the male operatic singer have, as we have seen, a particular
value to him in his chosen profession, but they are not an entirely unmixed
blessing. The singer's formant, whose frequency is essentially unchangeable,
can become a harsh and obtrusive element sawing away on the listener's
consciousness. This harshness can be avoided to some degree if the performer
is artistic enough to vary his singer's formant from nothing on up to its
maximum prominence, changing its magnitude as his musical surroundings
change. Similarly, his customary form of vibrato, which runs continually and at
its own pace completely independent of the rhythmic pattern of the music, can
give great audibility to his voice precisely because of the individuality of
its pattern. However, any piece of music is likely to require a resourceful
musician to employ once again the full range of variation, from no vibrato at
all, through one which comes and goes during the longer notes, to the more
fixed variety whose function we have already described. In short, maximum
audibility is not automatically advantageous-a voice whose rich variability
is skillfully made to appear and disappear in various ways provides a marvelous
vehicle for the display of true artistry.

19.5. Formant
Tuning and the Soprano Singing Voice

The soprano
singer uses tones from the upper portion of the range for human voices. The
relationships between a soprano's relatively high voice frequencies and those
of the formants she uses for speech will help us understand several of her
practices that are quite different from those of her male colleague. A particularly
striking practice of some sopranos will be the subject of this section.

One
evening in the fall of 1971 my wife and I noticed an arresting and most
attractive quality in the sounds we heard in a recording by the soprano Teresa
Stich-Randall as she performed the aria Porgi
amor from Mozart's opera The Marriage
of Figaro.[7] Whenever a note of the aria persisted a little, she seemed to
be "tuning" one or another of the vowel formants to a harmonic
component of the voice spectrum. It did not seem possible for her to start each
note with this formant matching already complete, but the adjustment would
take place rather quickly, making the tone "bloom" in a most pleasing
way. Enquiry among singers shows that this mode of singing is not in general
consciously cultivated. As a matter of fact, only a few singers do it with the
precision that first brought it to our attention. Many listeners also seem to
find it difficult at first to focus their attention on these acoustical
changes, though most will say they find the resulting tone color admirable. It
was easy for me to recognize this soprano's tuning process, since it was simply
a new example of what I am accustomed to listen for as I alter the resonances
of musical instruments by shading a woodwind tone hole with my finger or by
moving an ob­ject in and out of the bell of a woodwind or a brass instrument.
Such effects are important when I am asked to work on an instrument, because
they act as a guide to more permanent adjustments to its physical structure.

To help us see what is going on when a singer tunes her formants in the
way we noticed on the recording, we will look at a specific example. We will
suppose that our soprano, while singing a word having the vowel sound fool,
comes to rest on the note 1346 in the middle of the treble staff. At this point
she is producing a tone made up of harmonic partials whose frequencies are
466.2, 932.3, 1398.5, . . . Hz. Clearly, the first partial of her voice lies
somewhat above the 350-Hz position we would expect her to give formant 1 (17
percent above the 300-Hz value shown in the top part of fig. 19.6 for a male
voice). The singer alters her tongue, jaw, and lip positions a little bit from
her normal way of producing the fool sound, in such a way as to raise this
formant to match the fundamental component of her voice sound. Our meticulous
singer is next called on to sing a word having the vowel sound [ah] while
producing the note Ds, whose frequency components lie at 587.3, 1174.7, 1762.0,
. . . Hz. While sustaining her note she can make a small downward adjustment
in the frequency of her 1287-Hz second formant to make it coincide with the
second voice partial.

In
1972 Johan Sundberg made a set of observations on the way a professional soprano
placed her formants while singing various vowels. He found that singers tend to
align their formant frequencies in approximately the way just described, although
his experimental subject did nor align her formant tunings as closely as do
certain singers whom I have noticed. However, the general behavior observed by
Sundberg is entirely consistent with the possibility of exact tuning."

Figure
19.8 shows the kind of things that a soprano can do if she wishes (and is able)
to make close Tunings of her own voice formants to the voice frequencies required
by the musical circumstances. Marks for the chromatic scale notes be­tween C4
and A5# are arranged along the bottom axis of the figure, along with an
indication for the fundamental frequen­cies belonging to these notes. The
vertical axis is marked off with a frequency scale to indicate the
voice-partial frequencies, and those of various formants. The solid ­line
curves that rise toward the right show the trend of the fundamental frequency
and of its harmonics as one sings up the scale. Each curve is numbered at its
left-hand end to indicate the harmonic to which it refers.

The
sequence of dots along the lowest part of the graph shows the way in which the
frequency of formant one varies if one sings either too] or feel up the
chromatic scale between C4 and A5#.
This formant frequency is about 350 Hz for all notes below D;, and therefore is
not close to any of the voice harmonics. When the singer gets to E4, formant
one for these two vowels has a frequency that matches that of her voice
fundamental. As she sings further up the scale, she opens her mouth
progressively wider, moves her jaw, etc., to keep formant one in tune with
partial 1, even though their frequency rises from 329.6 Hz all the way up to
932.3 Hz. In other words, over a great part of her singing range a soprano is
able to strengthen partial 1 by letting it ride on the peak of the first
formant of either too] or [ee].

The
next progression of dots above the one we have just discussed shows what
happens semitone-by-semitone to formant one of the vowel (ah) as our fine-tuned
singer progresses up the scale. Below E4, this formant cannot be brought into
tune with a voice partial. From E4 to about G4 it is possible for vocal-tract
adjustments to be made matching formant one with the frequency of the singer's
second par­tial. Above this point in the scale, there is no reasonable way to
bring the first formant belonging to [ah) into resonance until we come to Es.
Beyond this the voice fundamental has risen sufficiently that it can be used to
guide the matching of the first formant of [ah) as well as those belonging to
the too) and [eel sounds recognized earlier.

Just
above the dots showing the first formant behavior of (ah] we find a similar
sequence for the variation of formant two belonging to loo]. This formant can
come under the control of the second voice harmonic from about A4# all the
way to the top of the range. Notice that above A4# the singer has the possibility of keeping both fundamental and second harmonic of her singing pitch in tune
with formants of [oo].Whether she does this, or picks one or the other, or
tunes neither to the formants of loo] presumably would depend on her skill and
also on the time available. There is also the possibility that for some singing
pitches it is not physiologically possible to attain both matchings
simultaneously.

The
second formant of [ah] jogs along in the general neighborhood of 1200 Hz over
the whole singing range, although it becomes a candidate for tuning below D4# and in the immediate neighborhoods of
G4 and D5. Sundberg found no evidence for an attempt at tuning the second
formant of [eel, as indicated by the gently sloping row of dots at the top of
the diagram. He finds this same lack of influence of upper partials on the tuning
for the second formants of two or three other vowels, all of which lie very
close to that shown for lee]. This observed lack of influence of the higher
partials is consistent with my own experience in the adjustment of wind
instruments. If one can get two or three air column resonances accurately
lined up with the lower partials of the sound spectrum, the listener and the
player are very pleased with the result. Evidence in support of this observation
can be traced in instrument making and performance practice at least back to
1720.

Let us ask now what musical resources are made available to a singer who
can tune one or two of her vowel formants to match at least approximately the
harmonic components of the note she is producing. Sundberg points out that
the most obvious advantage that comes from even an approximate tuning of the
first formant is a very large increase in the loudness of the sound a singer
can achieve for a given vocal effort. Not only will this be of use when she
must compete with strong accompaniments, but also in more normal musical
surroundings it has the advantage of increasing the range of dynamics that she
can produce between a just-audible pianissimo
and the fortissimo level that
corresponds to the maximum effort of which she is capable.

There
is a subtler effect of considerable musical importance which can be noticed
when there is exact tuning of any formant. We learned earlier in this chapter
that the inherent unsteadiness of the vocal cord os4illations gives rise to
minute fluctuations in both amplitude and frequency of the sinusoidal
components of the air­flow recipe. In the closing part of section 19.3 we
noticed that fluctuations in the frequency of a voice partial located on the
sloping side of a formant peak give rise to fluctuations in the amplitude of
the component as it is given to the room. In other words, there is more
amplitude unsteadiness to be detected in the radiated sound than is present in
the original excitation recipe from the larynx. When, however, the voice
partial finds itself perched at the rounded top of a formant peak, the
frequency fluctuations no longer give rise to additional amplitude variations,
and the tone takes on a particular smoothness and fullness. Once again it
should be remarked that my first awareness of the perceptual importance of an
altered relationship between the two kinds of source unsteadiness came from
study of the analogous behavior of orchestral wind instruments. This also led
to the development of a simple but highly precise method for the measurement of
air column resonance frequencies.

Whether
or not a singer tunes a formant precisely to a voice partial, we recognize
that her use of vibrato will have a very marked effect on the overall tone. The
vibrato is of course a smoothly vary­ing fluctuation in frequency which varies
almost sinusoidally half a dozen times per second. This makes for a
corresponding variation in the loudness of any partial that lies on the side of
a formant peak. If the vibrato centers itself to vary equally on either side of
a formant peak, the loudness drops briefly twice per cycle of the vibrato, as
its excitation frequency slides down alternately on the two sides of the
formant peak.

19.6.
Intermediate Voices and Various Musical Implications

You will
perhaps be wondering by now whether the male singer tunes formants to the
harmonic partials of his voice after the manner of the soprano, and you may
also be curious to know whether she borrows his custom of generating a singer's
formant. The answers to these questions lead us toward an understanding of the
ways in which tenors and altos cope with the musical demands made on their
voices, which lie acoustically in the region between the high and low voices
we have been studying.

Because
the male voice has formant peaks whose widths are comparable to the distance
between its closely spaced harmonics (see the top part of fig. 19.5), very
little change in the loudness of such a voice would be expected when formant
tuning takes place. The loudness contributed by a pair of partials that
straddle a formant peak is not very different from that produced when one of
these lies exactly on the peak while the other one is displaced some distance
down along the shoulder. To be sure, we can expect to find in the low voice a
slight and rather pleasant change of tone color caused, in passing, by ordinary
vowel changes and by vibrato, as discussed in earlier sections.

The
soprano makes almost no use of the singer's formant that is an important
resource of the male singer. We have learned that her habit of formant tuning
already gives her a powerful weapon in the battle for audibility (quite aside
from its important aesthetic function). Thus she has no particular reason to
seek additional reinforcements. Sundberg finds in addition that the muscular
requirements that must be met to produce the singer's formant are sometimes
incompatible with the adjustments that many of these same muscles must make in
tuning the formants.

Singers
whose voices lie between the bass and the soprano are apt to borrow heavily
from the techniques used by their higher- and lower-pitched neighbors. Thus the
alto will frequently use the singer's formant. In the same way one gets more
than a hint of formant tuning when tenors and altos use the higher parts of
their registers, where the technique becomes acoustically more effective.

Most
singers, throughout their musical range, constantly (though usually unconsciously)
manipulate the vocal tract formants to place their frequencies at musically
useful spots. These modifications in formant frequencies provide the major explanation
for the difficulty we often have in understanding the words of a song. The
patterns we are accustomed to use for the identification of spoken words are
modified in music to meet other requirements. Often the words used in a
musical setting require a high degree of understandability (for instance, in
musical comedy, light opera, and lieder singing).
In this type of music the singer and the composer both face an extremely
difficult challenge, quite aside from the question of competition with an
accompaniment, since both must constantly work toward getting the right word
sounds together with the right pitches.

Before
we leave the singers for a study of other musical instruments, we should notice
one more feature of their cone production which is of considerable musical
importance. The inherent unsteadiness of the vocal cord motion produces, as we
have seen, a slight fluctuation in the amplitudes and frequencies of the
various voice partials, even when there is no deliberate vibrato. It is useful
to recast our description of the resulting sound by recognizing that each
unsteady partial is in fact a closely spaced clump of randomly arranged steady
sinusoids; the strongest members of these clumps have very nearly the nominal
frequency of the partial, with weaker components being spread over a narrow
surrounding region of frequency. For some voices, each of these narrow­band
clumps of sound is spread across a pitch range of about 15 cents; for others it
is as narrow as 5 cents. My own voice lies in the middle of this
classification.

We
have already learned in our study of the piano the useful consequences of
having multiple clumps of partials (see sec. 17.3). For singers the same consquences
are manifested, but in a broader and smoother way. The beat phenomenon (which
is so pronounced between pairs of sinusoids) is very little heard between the
sounds of two slightly mistuned clumps of partials. For this reason, then,
slight errors of tuning between two singers produce far less clashing and
roughness than would arise, for example, from similar errors in the tuning of
two electric organ tones whose partials are made up of single sinusoids.
Curiously enough, the slight smearing of the partials of a singer's tone does
not prevent the production of audi­ble heterodyne components (see chap. 14). As
a matter of fact, the production of difference tones, as defined in the digression
in section 14.4, is particularly easy to demonstrate with the help of two
sopranos.

The
following example will show how the natural small fluctuations of the voice
affect the generation of heterodyne components. Suppose we feed two clumps of components,
P and Q, to a nonlinear device such as the human ear, P being centered at 300
Hz and Q at 450 Hz. Let us assume for the sake of numerical simplicity that in
both cases the smearing width of the clumps is one percent, so that in P the
components are spread over a range of 3 Hz, while in Q they extend over 4.5 Hz.
The simplest heterodyne components that are born of this pair of sounds are
clumps which are centered at the following frequencies:

2P =600 Hz, 2Q =900 Hz, (P +Q) =750 Hz, (P -Q) = 150 Hz

The extent of
the smearing of the resulting partials at these various locations depends
jointly on the widths of the ancestral clumps and on the details of the
strengths of the partials which are distributed within them. The spread of the
heterodyne clumps at 600, 900, etc., Hz might be something like the following:
4.2, 6.4, 5.4, 5.4 Hz. In every case the width of a heterodyne clump is
somewhat broader than the widths of its ancestors.

If
you refer back to our investigation in section 14.4 of the special
relationships between musical sounds, is will be apparent chat the broadening
of spectrum components into clumps by voice instabilities by no means destroys
these relationships. It does, however, remove the clearcut, all-or-nothing
nature of the beat-free in­tervals, converting them into a sort of pastel
version. This gives the composer a range between consonance and dissonance as
he writes his chords, making many things musically possible that are not successful
when he writes for instruments whose tones are made up of strictly sinusoidal
(single-component) partials.

19.7. Examples,
Experiments, and Questions

1.
Close your lips around one open end of a long piece of cubing with a 20­to-25-mm
diameter and sing a slowly rising glissando
from your bottom note. You will find certain sharply defined pitches at
which is it essentially impossible to produce any sound at all. Your vocal
cords will insist on jumping to either a higher or a lower frequency of
oscillation in a most unsettling and unfamiliar manner. For a pipe that is 150
cm long, a voice will act in this way ac frequencies close to 90, 185, and 265
Hz (only the highest of these is likely to be reachable by a woman); if the
pipe is 100 cm long, the disruptions occur near 130 and 250 Hz; for a 50 cm
pipe, the effect cakes place at a lowest frequency near 245 Hz.[9] You may wish
to verify that, as the piece of tubing is progressively shortened, its
disruptive effects become progressively weaker, and the frequencies ac which
they occur rise above 1000 Hz, which carries the phenomenon our of the singing
range for most of us.

The
upsetting effects produced by a piece of pipe on the vocal cord oscillations
take place at very narrowly defined frequencies, between which nothing unusual
is noticed in the "feel" of the experimenter's larynx. Since the
effect disappears completely as the pipe is shortened, it was indeed correct
in section 19.1 to treat the vocal cords as a normally autonomous
self-oscillating system which is not itself much influenced by the varying
acoustical properties of the vocal tract to which it is coupled.

2.
Several experiments having to do with formants can be done with a piece of
hardwalled tubing about 15 cm long with a diameter large enough (50 mm or so)
to fit around your ear while you
press the pipe airtight against the side of your head. With the pipe in place,
listen to the rushing sound produced by its response to random noise in the
room as you progressively close off the open end by sliding the flat of your
hand across it. The resonances of the cavity impose on the room noise a
spectrum envelope having formant like behavior, so that you hear something
like a progression of whispered vowel sounds. The lowest three formant like
frequencies associated with this cavity will be close to the following values:

Wide openend:520,1560,2600 Hz

Half the endarea
blocked:412,1357,2425 Hz

Three-quarters blocked:374,1310,
2390 Hz

Nine-tenthsblocked:321,1259,2359 Hz

The last of these
will give you a rough imitation of an too] sound, even though the formants do
not coincide with those given in the top part of figure 19.6.

3.
If you sing a vowel sound in the presence of a piano whose dampers are lifted,
many of the strings will be set into vibration. When your tone ceases, these
strings will be heard to give back a crude but often recognizable echo of your
vowel. This phenomenon can be exploited in many ways. For instance, you could
hold down only the key whose note name is the same as that of the tone you
sing, on the assumption that the various string modes will respond to your
sound. Why will this experiment work better if you simultaneously hold down
three keys, corresponding to the pitch of the note you are singing plus the
ones a semitone above and below? Numerous other combinations of selectively
damped or undamped strings will suggest themselves for your experimentation.

4.
Playing back various long-sustained vowels on a tape recorder at a speed
greater or less than that used in recording them can make quite startling
changes in what they sound like. For example, playing [ah) back at half speed
turns it into [oh] despite the fact the first-formant frequencies for these two
vowels are in the ratio 0.77, while the second formant ratio is 0.7, and the
higher formants have ratios close to unity. The tape recorder running at half
speed of course produces a ratio of 0.5 for all frequencies. Do you expect that
a double-speed playback of [oh] will necessarily give an [ah]?

5.
Deep-sea divers must work under conditions in which the atmosphere they breathe
is under very high pressure. To prevent "the bends," this atmosphere
generally has helium gas mixed in with the oxygen that is necessary to sustain
life. In such an atmosphere, the speed of sound and hence the frequencies of
the voice formants are raised considerably. In contrast to this change, why
would you expect only a small change in the oscillation frequency of the
diver's vocal cords, and so also in the pitch of his voice? There is a
considerable disruption of the intelligibility of speech when diving, caused in
part by the changes listed above and in part because the production of
consonants is deranged through changes in the air viscosity and density. Taking
everything into account, would you ex­pect greater disruption of speech intelligibility
for men or women divers? Would you expect the diver to have trouble
understanding what he hears over the telephone from his helper who is at the '
water's surface?

6.
Sound spectrographs are immensely useful laboratory tools for displaying visually
the changing patterns of strong and weak partials in the sounds of human
speech. It is inherent in the nature of these devices that sufficient speed to
follow rapidly changing sounds is attained at the expense of an ability to
measure accurately the frequencies of the individual partials; a sound
spectrograph shows only the general outline of the behavior of the formants.

From comic
strips and television shows one sometimes gets the impression that prints
generated by the sound spectrograph can be used to identify criminals in the
same dependable way that is possi­ble with fingerprints. You might find it
interesting to list for yourself a few of the important aspects of human speech
recognition which cannot be displayed by such a device. It turns out that the
most dependable identifications are made by expert human listeners who
supplement the evidence of their ears with several instruments, including the
spectrograph."

7. It is
sometimes possible to describe the tone quality of musical instruments by
telling what vowel their tone imitates (e.g., the [aw] sound attributed to the
English horn). This occasionally tempts people to draw the erroneous conclusion
that the spectrum of the instrument resembles that of the vowel. In the late
nineteenth and early twentieth centuries, studies of human speech were
generally able to uncover only the strongest formant (usually the first),
which led to a particularly trivial characterization of instrumental tone
color. A vivid example of the
acoustical disparity between a musical sound and its vocal imitation is the whet"
sound that was attributed (in part 5 of section 17.7) to the sound of
brushed-across piano strings. When one enunciates
this word, the first formant starts near 300 Hz, rises steadily to about 700
Hz, and then falls to 250 Hz. The second formant meanwhile starts at 650 Hz.
rises above 1000 Hz, and then dips to 900 Hz before rising fairly smoothly to
2250 Hz. The third formant has a slowly rising trend from 2500 Hz to about 3200
Hz. Meanwhile, the sound spectrum of the stroked upper strings of a piano has a
fundamental component that steadily rises from about 2100 Hz at C7 to about
4200 Hz at C8, while the second hamonic covers a similar variation at double
frequency, ending up at 8400 Hz. It would be interesting to know how our
nervous system operates on such complexities to give us impressions of speech like
sounds when we listen to musical instruments.

Notes

1.
An introduction to the speech process is n be found in their paperback hook of this
title Peter B. Denes and Elliot N. Pinson, TLS
Speech Chain (Garden City: Doubleday Anchor Books 1973). Another
introductory paperback is that c Peter Ladefoged, Elements of Acoustic Phonetics (Chicago: University of Chicago
Press, 1962).

3.
See, for example, James L. Flanagan, Spee
Analysis, Synthesis and Perception, 2d ed. (New York: Springer-Verlag,
1972), pp. 49, 233, and 250. This book is one of the basic sources of information
today about the mechanisms of human speech. See also Gunnar Fant, Acoustic Theory of Speech Production (The
Hague: Mouton, 1970), p. 271. This is the other major reference book on the
speech process.

4. J. L. Flanagan, "Voices of Men and Machinnes ,J. Acoust. Soc. Am. 51 (1972): 1375-87,
and James L. Flanagan, "The Synthesis of Speech," Scientific American. February 1972, pp.
48-58. The curves in figures 19.5 and 19.6 are calculated on the basis of data
found in Fant, Acoustic Theory of Speech
Production, pp. 109, 110, and 126. See also Flanagan, Speech Analysis. pp. 276--82.