a) Present address: Research Institute for Language
and Speech, University of Utrecht, Utrecht, The Netherlands

b) Also: Institute for Perception TNO, Soesterberg,
The Netherlands

Abstract

The perception of timbre differences in a vowel sung
by 8 male and 7 female singers has been investigated by means of two types of
listening experiments: (1) using the paradigm of the comparison of similarity, and (2)
using judgments on 21 semantic bipolar scales. Using INDSCAL analysis for the
similarity-comparison data and MDPREF analysis for the semantic-scale judgments, vowel
configurations in a multidimensional perceptual space were derived, as well as a space
which showed the weighting of perceptual dimensions by individual listeners (INDSCAL). The
interpretation of semantic scales was represented by directions in the perceptual space
(MDPREF). The perceptual vowel configurations, either based on timbre similarities or
semantic scales judgements, were comparable. Broadly, semantic scales clustered into the
categories vocal technique, general evaluation, vibrato, clarity, and sharpness. These
five clusters were not independent and could be described in two dimensions. Timbre
differences could be predicted on the basis of differences in 1/3-octave spectra of the
vowels. It showed up that only sharpness had a constant interpretation for the various
stimulus sets and was roughly related to the slope of the spectrum. One experiment, using
a song phrase, extended the results to a more general domain.

Introduction

In phonetic research, the perception of singing has
been given relatively little attention. In an excellent review, Sundberg (1982) commented
that experimental data in this field mainly focus on vowel intelligibility as a function
of fundamental frequency, recognition of vocal registers, perceptual determinants of voice
classification, and the effect of vibrato on perceived pitch. In many of these perceptual
studies, listeners have been shown to use timbre in distinguishing between vowels,
registers, or singer voice types. Timbre is used, for instance, as the perceptual quality
which allows one to judge whether a phonation has been sung in the falsetto or modal
register. However, the perception of timbre itself has hardly been investigated explicitly
in singing. Yet the importance of timbre cannot be overestimated: most of the vocabulary
in voice pedagogy for the description of voice quality is related to timbre; it is
involved when a voice is called light, dark, pressed, warm, mellow, and so on; the list of
terms seems unlimited. A number of attributes of voice quality also refers to the
associated singing technique (e.g. covered, open, throaty, pressed, free), as was
investigated by van den Berg and Vennard (1959). When we ask for an explicit, objective,
definition of these many terms, only fragmentary data are available. For fruitful
discussions on the singing voice, however, this knowledge seems to be an essential
premise.

Timbre, tone-color, or "Klangfarbe", has
been defined by the American Standards Association (1960) as that attribute of auditory
sensation in terms of which listeners can judge that sounds having the same pitch and
loudness are dissimilar. In an effort to present acoustic variables underlying timbre,
Schouten (1968) mentioned five factors: (1) tonal or noise-like character, (2) the
envelope of the frequency spectrum, (3) the temporal envelope, (4) change in spectral or
temporal envelope, and (5) the onset. Plomp (1970) left dynamic aspects out of
consideration and investigated the timbre of steady-state sounds. For these sounds, timbre
is determined by the frequency spectrum only. He showed that the timbre of sounds can be
represented as points in a multidimensional perceptual space in which distance corresponds
to dissimilarity in timbre: the larger the distance in this space, the more the sounds are
perceptually dissimilar in timbre. Since such a representation of timbre is based on
perceived dissimilarity, it is a psychoacoustical perceptual representation, which has its
origin in properties at a peripheral auditory level of perception. In a large part of the
present study we limited ourselves to stationary sung vowels, so we could use the spatial
representation of timbre.

The perceptual dissimilarities in timbre of
stationary sounds are correlated with the differences in their spectra. This has been
shown by Pols et al. (1969) for spoken vowels and by Plomp (1970, 1976) for sounds of
musical instruments. The spectral representation was based on sound levels (in dB) in
1/3-oct bandpass filters, which approximate the critical bandwidths in audition. The close
correspondence between the (subjective) perceptual space and the (objective) spectral
space makes it possible to study timbre dissimilarities in terms of their underlying
spectral differences.

Within the framework of a spatial representation of
timbre we follow in this paper two main lines in studying timbre in singing:

(1) Is the correspondence between the perceptual and
the spectral spaces for the timbre differences between vowels also valid for small timbre
differences, such as timbre differences in the same vowel sung by different singers?

(2) How can we represent relations between
descriptive terms of timbre in the perceptual and spectral timbre space?

These two main lines have been investigated in two
experiments, which also provided the opportunity to study the following issues:

(a) How consistent is a listener in the use of
timbre terms and what is the agreement among listeners about the meaning of these terms?

(b) What is the influence of musical training on the
use of timbre terms?

(c) How do the many descriptive terms of timbre
cluster and what is their spectral interpretation?

(d) How does timbre perception of steady-state
vowels compare with timbre perception of a song phrase?

I. Material and spectral analysis

In a concert hall, recordings were made of vowels
sung by nine female and eight male singers. All but one (professional bass-baritone singer
1) were advanced students of the Sweelinck Conservatory in Amsterdam, aged between 19 and
26, with 3 to 7 years of vocal training. The microphone distance was 0.3 m, so that the
direct sound predominated. According to their own voice classification, the group
consisted of: 2 bass-baritone, 2 baritone, 4 tenor, 2 alto, 3 mezzo-soprano, and 4 soprano
singers (Table I). The vowels /a/, /i/, and /u/ were sung at a comfortable level at
fundamental frequencies (Fo) of 131 (C3), 220 (A3), and 392 Hz (G4). A tone at these Fo
values was repeatedly presented during the recording sessions to cue the singer. Some male
singers performed at Fo = 392 Hz in modal register and falsetto register as well.

Table I. Classification of the
singers

male singers

female singers

1

bass-baritone

9

alto

2

bass-baritone

10

alto

3

baritone

11

mezzo soprano

4

baritone

12

mezzo soprano

5

tenor

13

mezzo soprano

6

tenor

14

soprano

7

tenor

15

soprano

8

tenor

16

soprano

.

.

17

soprano

Since we wanted to investigate spectral attributes
of timbre, temporal variations such as vibrato were removed from the vowel sounds. To
accomplish this, each vowel sound was digitized (10 kHz sampling frequency); subsequently
a single period with a fixed number of samples was segmented from the central part of the
vowel, and this period was repeated to obtain a stimulus duration of 400 ms. Care was
taken that the beginning and end of the segmented period did not show a discontinuity. To
avoid clicks at the onset and offset of each stimulus, the sound level increased and
decreased smoothly during the first and last 40 ms, respectively.

Ten different subsets of eight or nine vowel sounds
were made by combining phonations of the same vowel with the same Fo sung by different
singers. The subsets were organized according to vowel type (/a/, /i/, and /u/),
fundamental frequency (131, 220, and 392 Hz), and sex of the singers (see Table II). As
the table indicates, the vowel /a/ was studied for all Fo values, the vowels /i/ and /u/
for the midrange values of 220 Hz (males) and 392 Hz (females) only. Subset V combined a
selection of vowels /a/ by both male and female singers (Fo = 220 Hz). Subset X combined
/a/ phonations sung in the modal register and the falsetto register by male singers (Fo =
392 Hz). Because of the limitations imposed by the listening experiments, the maximum
number of stimuli in each subset was nine. The loudness of the vowels in each subset was
equalized by means of a subjective matching procedure.

In order to determine whether the restriction to
stationary vowels was justified in studying the relations between descriptive timbre
terms, we added subset XI derived from short sung phrases. Recordings were made of a Dutch
folk song sung by the eight male singers. From this folk song the phrase 'Halleluja' was
extracted. Since the singers were free in their interpretation of this song, their
recordings varied in time between 2.6 and 4.4 s. The loudness of the phrases was equalized
by means of a subjective matching procedure.

The vowel stimuli were analyzed with a
computer-controlled filter bank (Pols, 1977). Eleven 1/3-oct band-pass filters were used
with center frequencies from 400 Hz up to 4 kHz, whereas the filters below 400 Hz were
replaced by three 90-Hz wide filters with center frequencies of 122, 215, and 307 Hz,
respectively, according to the concept of critical bandwidth in audition. The total number
of filters used was 14, 11, and 8, for Fo = 131, 220, and 392 Hz, respectively, because
filter bands which did not contain a partial were excluded.

We used two categories of listeners. The first
category consisted of seven non-musicians (who had never had any musical training), the
second category of nine musicians (five singers and four teachers of singing). These two
categories of listeners were chosen to investigate the influence of musical training on
timbre perception. All listeners had normal audiograms. They were paid for their services.

2. Method

To map the perceptual representation of timbre
differences we used the method of paired comparison of pairs, with the modification that
both pairs should have one stimulus in common. Different from the triadic comparison
technique used by Pols et al. (1969) and Plomp (1970), in which the listener can listen
repeatedly to the three stimuli before deciding which pair is most similar and which is
most dissimilar, in the present procedure each pair of pairs was presented only once,
after which the listener had to indicate which pair contained the more similar stimuli.
The listener, seated in a sound-proof room, heard the stimuli monaurally at a comfortable
level through earphones (Beyer DT-48). All possible pairs of pairs of vowels in a subset
were presented in random order, while each stimulus of a subset was presented equally
often in the first and second pairs. To eliminate order effects, half of the listeners
heard the pairs in reversed order (e.g. AC-AB instead of AB-AC). Stimulus generation,
timing, and response processing were handled by a PDP 11/10 computer.

3. INDSCAL analysis

The results of the paired-comparison experiment were
collected for each listener in a dissimilarity matrix: Every time the listener judged a
particular pair of vowel stimuli as more similar than another pair, the more similar pair
scored one point. The total number of points which could be assigned to a pair could vary
between zero and the total number of vowel stimuli minus one. As we were also interested
in intersubject differences in the representation of timbre, especially between musicians
and non-musicians, the dissimilarity matrices were analyzed by means of a quasi non-metric
version of INDSCAL (Carroll and Chang, 1970). In this multi-dimensional scaling program,
the subjects are assumed to weight differentially the several dimensions of a common
perceptual space, the so-called object space. The individual weighting factors for each
dimension are presented in a subject space. The dimensionality of the INDSCAL solution was
chosen on the basis of the results of matching the configuration in the object space with
the spectral representation of the stimuli, discussed in the next section. The correlation
in each matched dimension had to be significant beyond the 0.01 level.

4. Matching of the perceptual and spectral
vowel configurations

For each subset of stimuli we had available both a
perceptual and a spectral configuration of the vowels. The latter configuration is given
by the multidimensional representation of the spectra with the sound level in each filter
band as coordinates. To investigate the agreement between these configurations we matched
them, using the procedure of rotation to maximal congruence (Schönemann and Carroll,
1970) between the spectral and the perceptual configurations. As a result, each perceptual
dimension is optimally fitted with a direction in the spectral space. This direction is
defined as a linear combination of the original spectral dimensions (filter band levels).
As a measure of fit we computed (1) the correlation between vowel coordinates in each
perceptual and matching spectral dimension and (2) the coefficient of alienation (Lingoes
and Schönemann, 1974); this coefficient varies between 0 (perfect fit) and 1 (unrelated
configurations) 1). Since the perceptual space derived from INDSCAL is normalized, we
computed weighting factors for its dimensions which minimized the coefficient of
alienation. Since this procedure involved very little additional rotation of the spectral
space, the correlation coefficients per dimension remained practically unchanged.

B. Results and discussion

An example of a perceptual (object) space, the
matched spectral space and the listener (subject) space is shown in Fig. 1 for subset II
(/u/, sung by eight male singers at Fo = 220 Hz). This subset was judged by nine singers
and teachers of singing and seven non-musicians. The upper panels of Fig. 1 illustrate the
very good agreement between the vowel configuration in the three-dimensional perceptual
space (filled circles) and the matched configuration in the spectral space (open circles).
As can be seen from the subject space (lower panels), the intersubject differences in the
weighting of the perceptual dimensions is large, both for the musicians (open symbols) and
the non-musicians (solid symbols). The average listeners' weighting factors for the three
dimensions were .57, .44, and .41, respectively. An analysis of variance of the individual
weighting factors did not show a significant difference between musicians and
non-musicians. Consequently, separate analyses of the data from musicians and
non-musicians resulted in similar perceptual spaces (not shown). This finding supports the
view that in the comparison of timbre similarity, musical knowledge and experience do not
play a part.

A summary of the results of matching the perceptual
vowel configuration with the spectral vowel configuration for all vowel subsets is given
in Table III. This table gives first the total spectral variance for each subset. This
total variance is, accumulated over all bands, the sum of the variances in the filter
bands, representing the spectral differences between the vowel stimuli of each subset.
They vary between 103 and 332 dB2, due to differences between singers; this range agrees
well with previous findings of Bloothooft and Plomp (1984) for 14 professional singers.
The spectral differences between the modal register and the falsetto register introduced
the largest spectral variance (subset X).

Table III.Results for the ten
subsets of vowel stimuli of matching the configurations in the perceptual space with the
spectral space. For each subset the total spectral variance is given as well as the
percentage of this variance accounted for by the computed dimensions as far as the
perceptual and spectral ones correlated significantly at the 0.01 level or between 0.05
and 0.01 (figures within brackets). The subsets were judged by seven listeners, except
subset II which was judged by 15 listeners (see text and Fig. 2). The coefficient of
alienation has been computed over all given dimensions D.

set

vowel

Fo (Hz)

total spectral
variance (dB2)

percentage of total
spectral variance in matched dimensions

coefficient of
alienation

D1

D2

D3

D4

sum

I 8 M

/a/

131

211

34

21

19

17

91

0.39

II 8 M

/u/

220

255

45

21

19

-

85

0.47

III 8 M

/i/

220

239

37

22

16

14

89

0.44

V 9 M+F

/a/

220

200

30

26

26

7

89

0.48

VI 9 F

/a/

220

114

48

(21)

-

-

69

0.68

VII 9 F

/u/

392

248

38

36

(15)

0

89

0.52

VIII 9 F

/i/

392

180

28

24

19

18

89

0.42

IX 9 F

/a/

392

103

20

(34)

-

-

54

0.76

X 9 M

/a/

392

332

38

37

(11)

-

86

0.54

The next five columns give the percentage of total
spectral variance explained by the common dimensions of the perceptual and spectral vowel
configurations. The number of dimensions for which the correlations between vowel
coordinates along the perceptual and matched spectral dimensions were significant beyond
the 0.01 level, varied between one and four. This number is probably related to a minimum
amount of spectral variance needed to define a perceptual dimension. It was found that the
least significant dimension explained on the average 34 dB¨ of spectral variance. If
spectral variance is uniformly distributed over frequency bands, this value corresponds to
a standard deviation of about 2 dB for each frequency band. If spectral variance is
concentrated in a single frequency band, the variance of 34 dB¨ would correspond to a
standard deviation of 5.5 dB in that band. De Bruyn (1978) concluded from investigations
on timbre dissimilarity of complex tones that two complex tones are distinguished well by
listeners for a mean difference in sound levels of between 3 and 5 dB in each 1/3-oct
band. The difference limen for individual harmonics in vowel sounds was estimated by
Kakusho et al. (1971) to be less than 2 dB for most vowels. These thresholds roughly
indicate that in our investigation the correspondence between spectral representation and
psycho-acoustic representation of timbre is valid up to the perceptual threshold of timbre
differences. This limit determined the dimensionality of the perceptual vowel
configuration for all Fo values investigated. For low Fo values this is in agreement with
results obtained by Nord and Sventelius (1979) concerning just-noticeable differences in
formant frequency, and by Klatt (1982) for a number of physical manipulations of a single
vowel.

C. Conclusions

(1) The prediction of timbre differences on the
basis of 1/3-oct spectra is valid up to the perceptual threshold of differences in timbre,
and thus for all kinds of timbre differences in stationary sung vowels.

(2) This prediction is valid up to Fo values of at
least 392 Hz.

(3) In judgments of similarity of timbre,
musical knowledge or experience does not play a role.

III. Experiment 2: Relations between
descriptive terms of timbre

A. Procedure

1. Semantic bipolar scales

For the study of descriptive timbre terms we
designed a listening experiment in which sounds were compared on a number of semantic
bipolar scales. Each semantic scale consisted of two adjectives with opposite meaning,
describing timbre characteristics such as light-dark and colorful-colorless. For the
determination of the set of semantic scales to be used in our listening experiment we
first collected 50 scales from related studies on timbre (Isshiki et al., 1969; Donovan,
1970; von Bismarck, 1974a; Fagel et al.,1983; Boves, 1984) and from the literature on
singing (Vennard, 1967; Hussler and Rodd-Marling, 1976). These semantic scales were rated
by seven experts (speech therapists, teachers of singing) on their suitability for
describing the timbre of sung vowels. Of the 50 scales, 21 were generally judged to be
suitable (Table IV). Of these, scales 1 to 14 and scale 21 were regarded as commonly
known adjectives for the description of timbre and were used by all listeners. The scale
vibrato-straight (21) was only used for judgments of the song phrase. The scales
free-pressed (15), open-throaty (16), and open-covered (17) were considered to be
evaluative of singing technique. The scales dramatic-lyrical (18), soprano-alto (19), and
tenor-bass (20) were intended to investigate relations between timbre and voice
classification. These six scales were judged only by the musicians.

2. Method

The interpretation of a semantic scale was
investigated using the method of paired comparisons. The listener had to judge which of
the two stimuli presented was closer to a given target adjective of a semantic scale, for
example: which of two stimuli was darker (on the semantic scale light-dark). The chosen
stimulus scored one point. After all possible pairs of stimuli had been judged, the total
number of points obtained by each stimulus was used to rank the stimuli of a subset from
light to dark. Semantic scales were handled one after another. For a set of nine stimuli,
720 judgments by a listener were needed to investigate 20 semantic scales. In Table IV the
adjectives used as targets are underlined.

Experiments performed in this way are very time
consuming. Hence we reduced the number of subsets to five subsets of stationary vowels
(II, III, V, IX, X) and one subset of song phrases (XI). In this selection the vowel /a/
was investigated for Fo = 220 and 392 Hz for both male and female singers, while the
vowels /i/ and /u/ were investigated for Fo = 220 Hz for male singers only.
Three half-day sessions were planned for each listener to complete all measurements. Most
of the musicians could not finish their task within the time available; therefore the
number of listeners per subset varied. The stimuli were presented in a random order but
equally frequently in first and last position of a pair. Half of the total number of
listeners heard the stimuli of a pair in reversed order. The experiments were computer
controlled. The same musicians and non-musicians who participated in Experiment 1 served
as listeners. They were paid for their services.

B. Results

1. Reliability

To judge the reliability of the results of a
listener in a paired comparison experiment on a single semantic scale, we computed the
number of circular triads the listener made. A circular triad occurs when, for example,
stimulus B is judged to be darker than A, and C is judged to be darker than B but not
darker than A. In a subset with 8 stimuli a score of less than 9 out of a maximum of 20
possible circular triads was accepted as a consistent and reliable result (0.05
significance level); in a subset with 9 stimuli this criterion corresponds with less than
14 out of a maximum of 30 circular triads (e.g., Edwards, 1957). Three different
explanations of circular triads can be given: (1) the stimuli are almost equal, (2) the
semantic scale is not appropriate, or (3) the listener response is not reliable.

In the case of a subset with many almost equal
stimuli a high number of circular triads for all semantic scales has to be expected. In
Table V we give, for the six subsets, the percentage of accepted semantic scale results
for musicians and non-musicians. Musicians are, on the average, more reliable (85 % vs. 71
%). The number of accepted results for each subset shows, especially for the
non-musicians, a clear relationship with the total spectral variance of the subset (last
column). The song phrase, subset XI, is judged most consistently by all listeners. This
demonstrates that the voice characteristics are much more distinct in a song phrase than
in stationary vowels.

Table V.Percentage of
judgments with a sufficiently low number of circular triads on all semantic scales for
each subset. The total spectral variance per subset is also given.

set

vowel

Fo

musicians

non-musicians

total spectral variance (dB2)

II 8 M

/u/

220

90

81

255

III 8 M

/i/

220

79

64

239

V 9 M+F

/a/

220

90

94

200

IX 9 F

/a/

392

64

54

114

X 9 M

/a/

392

87

83

332

XI 8 M

phrase

94

92

average

85

71

Since an inappropriate semantic scale will result in
inconsistent results for all listeners, we computed for each semantic scale the percentage
of accepted judgments, according to the criterion given above, over the five vowel subsets
and for the song phrase (Table IV). For the musicians the semantic scales with less than
75 % accepted judgments were full-thin, rough-smooth, strong-weak, open-throaty, and
dramatic-lyrical. This can be explained for the scales rough-smooth and dramatic-lyrical
by the fact that these scales refer to temporal characteristics which are not present in
the vowel subsets; for strong-weak it suggests that this scale is interpreted as
loud-soft, which was eliminated by loudness matching.

Table IV.Bipolar semantic
scales used in Experiment 2, ordered according to the consistency of listeners' judgments.
Target adjectives are underlined. The columns give the percentage of judgments with a
sufficiently low number of circular triads on a semantic scale for musicians and
non-musicians, both as the averaged number over the five subsets of stationary vowels, and
for the subset of song phrases.

semantic scale

musicians

non-musicians

vowels

phrase

vowels

phrase

2

light -dark

89

100

74

82

19

tenor -bass

94

100

-

-

7

high -low

89

100

63

100

17

open -covered

89

90

-

-

3

sharp -dull

87

100

71

100

13

metallic -velvety

87

100

69

82

11

cold -warm

85

100

66

100

15

free -pressed

85

90

-

-

9

angular -round

85

90

63

65

4

clear -dull

82

100

82

100

6

shrill -deep

82

90

77

82

12

colorful -colorless

78

100

51

65

14

melodious -unmelodious

78

100

51

82

1

light -heavy

76

90

71

82

18

dramatic -lyrical

70

80

-

-

5

full -thin

69

80

60

64

10

strong -weak

65

100

55

82

16

open -throaty

59

90

-

-

20

soprano -alto

55

-

-

-

8

rough -smooth

52

100

37

65

21

vibrato -straight

-

90

-

50

2. Relations between semantic scales

The 21 semantic scales may be expected to be highly
interdependent. As a tool to reveal these relations, while preserving intersubject
differences among listeners, we again used the multidimensional scaling technique of
INDSCAL. For a given subset we had for each listener stimulus scores on all semantic
scales; these stimulus scores were rank-correlated for each pair of semantic scales. In
this way we obtained a correlation matrix of semantic scales for each listener. These
correlation matrices were analyzed by INDSCAL.

The semantic scales are represented by positions in
an object space in which distance is related to correlation: (1) the closer the positions
of two scales, the more their correlation coefficient approaches the value of 1, and the
more the scales are synonymous, (2) the more distant the position of two scales, the more
their correlation coefficient approaches the value of -1, and the scales are inverted
versions of each other, (3) in between, with one scale positioned in the origin of the
space or in another dimension, the correlation is zero and the scales are unrelated. This
shows that the point configuration, representing semantic scales, depends on the
arbitrarily chosen polarity of the semantic scales. To eliminate this problem, each
correlation matrix was extended by including all (redundant) correlations between semantic
scales with reversed polarities. After this extension, the versions of a semantic scale
with opposite polarities, of which the correlation is -1 by definition, are positioned
radially and symmetrically relative to the origin in the object space. The complete
solution is therefore radial-symmetric relative to the origin. In the final presentation
of the configuration in the object space, however, only that polarity of a semantic scale
will be given which is the more easily interpretable.

As a final step in summarizing the results, we
combined the correlation matrices of all non-musicians for the five vowel subsets in one
INDSCAL analysis. This was also done for the musicians (who judged an augmented set of
semantic scales). This combination of matrices from different vowel subsets seems
justified, since the subject space revealed that, both for musicians and non-musicians,
the relations between the semantic scales were independent of the subset. This implied
that the relations between semantic scales have a general validity for a listener,
irrespective of vowel and Fo. For the sake of clarity in presentation, we averaged the
position of each listener in the subject space over the vowel subsets. Results for song
phrases were analyzed separately. The results of a two-dimensional INDSCAL analysis for
all four cases are presented in Fig. 2. We limited ourselves to a two-dimensional solution
of INDSCAL, since higher dimensions mostly presented only unique relations between
semantic scales found for individual listeners. This effect can, for instance, be seen in
the second dimension of the subject space for non-musician judgments of stationary vowels,
which is almost exclusive to listener 5. The two-dimensional INDSCAL analysis included
only a limited part of the total variance in the correlations between semantic scales,
varying between 49.1 % (musicians, vowels) and 75.2 % (non-musicians, song phrase). This
relatively low percentage explains the deviations of subject weightings from the unit
circle. Examination of the subject space shows, especially for non-musicians, a tendency
towards a one-dimensional interpretation of the object space, which was either directed
towards dimension I or towards dimension II.

Before interpreting the object space, we should
mention some general properties of that presentation: (1) semantic scales which are close
to each other are highly correlated and are used more or less synonymously; for easier
interpretation, clusters of semantic scales are circled; (2) the distance of a scale to
the origin is a measure of its discriminative power (at least in this plane): the closer a
semantic scale is located to the origin, the more synonymous it is with its reversed
version and the smaller its discriminative power is.

Let us first consider the configuration of semantic
scales in the object space for musicians. The clusters of semantic scales are from left to
right:

(1) Singing technique: free-pressed (15) and
open-throaty (16). For the song phrase the reversed scales 1, 9, and 11: heavy-light,
round-angular, and warm-cold can also be considered to belong to this group. Round and
warm are probably used as more impressionistic alternatives to the description of a good
singing technique. For the stationary vowels, the scales free-pressed and open-throaty
cluster with the scales for a general evaluation.

(2) General evaluation: melodious-unmelodious
(14), colorful-colorless (12), and full-thin (5). For the song phrase these scales are
positively related to scales on singing technique and temporal aspects; for the vowels
they overlap the scales on singing technique.

(3) Temporal aspects: vibrato-straight (21)
and rough-smooth (8). Rough-smooth is unreliable for stationary vowels (this conclusion
was also drawn by von Bismarck, 1974a). Roughness is probably mainly related to
irregularities in periodicity, not present in our vowel stimuli since they consist of
repetitions of a single pitch period. For the song phrase both scales are used
independently from scales on singing technique.

(4) Clarity: clear-dull (4), high-low (7),
strong-weak (10), and open-covered (17). It is noteworthy that the technical scale
open-covered (17) is to a great extent unrelated to other scales on singing technique,
namely free-pressed (15) and open-throaty(16).

(5) Sharpness: sharp-dull (3), light-dark
(2), shrill-deep (6), and metallic-velvety (13). For the song phrase the scale tenor-bass
(19) is also included in this group. The scale soprano-alto was not used in the present
analysis, since it was not applied to all vowel subsets. Separate analysis of subsets VI
and VIII showed that the scale soprano-alto led to the same judgments as sharp-dull. For
stationary vowels the scales light-heavy (1), angular-round (9), and cold-warm (11) can be
included in sharpness, too. The scale dramatic-lyrical (18) is judged ambiguously. For the
song phrase the scale is related to temporal aspects, clarity and sharpness, for the
stationary vowels the scale presents a general evaluation.

According to the characteristics of the INDSCAL
analysis, the two dimensions of the object space may have a fundamental psychological
meaning. In this view, we suggest that dimension I can be interpreted as a 'pleasantness'
factor (melodious-unmelodious, free-pressed, open- throaty, cold-warm, angular-round),
while dimension II can be interpreted as a 'potency' factor (clear-dull, vibrato-straight,
strong-weak). When both dimensions are equally weighted, as was the case for most
musicians, the clusters of semantic scales mentioned above can be described by some
combination of these two factors. For some listeners, however, there was only one
dominating dimension. When dimension I dominates, the scales which evaluate temporal
aspects do not discriminate, while all other scales are used in the same way, with
sharpness and clarity negatively related to general evaluation and singing technique. When
dimension II dominates, the scales on singing technique do not discriminate, while all
other scales are used in the same way, with sharpness, clarity, and temporal aspects
positively related to general evaluation.

For non-musicians, the results for the song phrase
and the vowels are very similar and show a further simplification relative to the results
of the musicians. The interpretation of the configuration is easy since non-musicians on
the whole use, as the subject space indicates, either dimension I or dimension II. This
implies that, apart from some unreliable scales, most semantic scales are used
synonymously, according to sharpness. In view of the position of scales 12 and 14, the
subjects only differed in their opinion as to whether sharp had to be associated with
unmelodious (when dimension I was used) or melodious (when dimension II was used).

In summary, semantic scales are used very
one-dimensionally by most non-musicians and also by some musicians. Most semantic scales
are used according to sharpness and the only difference between listeners is whether sharp
is positively or negatively associated with a general evaluation of the sound and with
singing technique. Most musicians, however, differentiate more clearly between semantic
scales, especially for the song phrase, for which clusters of semantic scales related to
singing technique, general evaluation, temporal aspects, clarity, and sharpness can be
distinguished. However, these groups of scales are not independent, but can be represented
in two dimensions with the psychological interpretations of 'pleasantness' and 'potency'.
For vowels, the results did not depend on the type of vowel nor on fundamental frequency.

3. Relation between semantic scales and
perceptual space

In order to learn how the semantic scales are
related to the perception of the stimuli of a subset, we used the method of
multidimensional analysis of preference MDPREF (Carroll, 1972). From the listening
experiments we obtained for each listener, for each semantic scale, stimulus scores which
ranked the stimuli from one adjective of the semantic scale (low score) to the other
adjective (high score), for instance from the most light to the most dark stimulus. For a
given subset, the stimulus scores on all semantic scales, for all listeners, served as an
input for the MDPREF analysis. In MDPREF the ordering of stimuli along a semantic scale is
represented as an ordering of stimuli along a straight line in a perceptual space. The
MDPREF algorithm computes in a multidimensional perceptual space both the stimuli, which
are represented as points, and the semantic scales, which are represented as straight
lines through the origin of that space.

Mathematically a straight line is described by a
vector. The direction of a vector is computed in such a way that the projection of the
stimulus points on the vector corresponds as closely as possible (least-squares criterion)
to the stimulus scores on the semantic scale concerned. The first dimension of the
perceptual space explains most variance in the listeners' judgments, the second most of
the remaining variance, etc. The stimulus configuration is normalized so that the variance
is equal in all dimensions. The semantic-scale vectors are given unit-length and in
graphical presentations of results they are represented by their end points. An example of
such a representation is given in Fig. 3.

Our application of MDPREF is unconventional because
normally only one aspect (preference) of a stimulus set is investigated. We employed
MDPREF on data from a large number of different semantic scales. This is allowed if we may
assume that the stimulus configuration in the perceptual space is invariant for the
different semantic scales. In Sec.IIIB4 we will show that the stimulus configuration
derived by MDPREF and the one derived in Experiment 1 with the similarity-comparison
paradigm, are highly congruent. This supports the view that the perceptual representation
of timbre is an invariant stimulus representation, which is the basis for the further
labeling of directions in that space by semantic-scale vectors.

Because we showed in Experiment 1 that the stimulus
configuration in the perceptual space did not depend on musical experience, we combined in
the present experiment for each subset the judgments on all semantic scales for both
musicians and non-musicians. The interpretation of semantic scales, however, may vary from
scale to scale and from listener to listener, and will be represented by the
semantic-scale vector configuration. Whenever a listeners' judgments on a semantic scale
included too many circular triads, these data were not used in the MDPREF analysis.

The result of the MDPREF analyses did not allow a
generalized interpretation of the semantic scales, due to great intersubject differences.
This will be demonstrated with the examples in Figs. 4 and 5. In Fig. 4a-d typical
judgments on the semantic scales are presented for the subset of song phrases. All panels
present the same stimulus configuration but show different subsets of semantic-scale
vectors. The first two dimensions of the perceptual space explain 71 % of the total
variance in semantic-scale judgments. In Fig. 4a and Fig. 4b all accepted semantic
scales for musician 7 and non-musician 2 are presented. The spread of the positions
of the semantic-scale vectors shows that the musician is able to distinguish between
several characteristics of the song phrases. The clustering of the semantic-scale vectors
for the non-musician demonstrates that most semantic scales, except vibrato-straight (21),
are used synonymously, implying that this subject is unable to describe more than one
perceptual dimension. For both listeners the relations between the semantic scales are in
good agreement with the general results, presented in Fig.2.

In Fig. 4c and Fig. 4d, for all musicians (open
circles) and non-musicians (filled circles) the directions of the semantic scales
clear-dull and colorful-colorless are shown, respectively. The scale clear-dull shows
corresponding judgments along the first dimension for all listeners. This is not the case
for the scale colorful-colorless, which most musicians judged to be close to the second
dimension of the perceptual space, while non-musicians judged again along the first
dimension. It can be seen that along dimension I some listeners even have opposite
opinions on this scale. The musician and the non-musician for whom the semantic-scale
vector is positioned near the origin use this scale in another perceptual dimension.

Fig. 5 shows the two-dimensional MDPREF solution for
subset III (/i/ sung by 8 male singers at Fo = 220 Hz). The positions of all accepted
semantic scales are given for musicians 2 (open squares) and 4 (open circles) and
non-musicians 2 (filled squares) and 4 (filled circles). In this example of
stationary vowels, the clustering of most semantic scales per listener illustrates that
the scales are used synonymously but in different ways by the individual listeners. This
effect was most clearly present in the subsets with stationary vowels.

The application of semantic scales can only be
demonstrated in examples such as those given in Figs. 4 and 5. Due to strongly listener-
and vowel-dependent behavior, the results are difficult to generalize. Especially when a
listener used all semantic scales synonymously, it is likely that only one particular
perceptual attribute was dominant in the vowel subset, and that all semantic scales were
judged according to this attribute. The attribute concerned differed among listeners.

4. Comparison with the vowel configurations
derived from Experiment 1

The MDPREF analyses resulted in a perceptual vowel
configuration for each vowel subset. Since for these subsets vowel configurations were
also available from INDSCAL analysis of timbre dissimilarities in Experiment 1, we
investigated whether these two configurations, derived with quite different techniques,
were comparable. For this purpose, we rotated both normalized vowel configurations to
maximal congruence. For all subsets the correlation between coordinate values of vowels on
matched dimensions was significant beyond the 0.01 level and the coefficient of alienation
varied between 0.36 and 0.54. Fig. 6 shows an example of the matching of the vowel
configurations for subset IV, for which the matching with the spectral vowel configuration
was already shown in Fig. 1.

Although the fit between the two configurations was
generally good for all the subsets, there was no one-to-one correspondence between the
original dimensions of the two spaces. The dimension related to sharpness, for instance,
did not come out immediately in the INDSCAL analysis of the similarity-comparison data.
This indicates that the dimensions originally derived by means of INDSCAL did not have a
unique psychological meaning. Apart from the difference in the orientation of dimensions,
we may conclude that both the comparison of timbre similarity of vowels (Experiment I) and
the ranking of vowels along semantic scales (Experiment II) resulted in the same
configuration of vowels in the perceptual space.

5. Spectral correlates of perceptual
dimensions of vowel subsets

The vowel-point configuration derived by means of
MDPREF can also be related to the 1/3-oct spectra representation of the vowels. This was
done in the same way as in Experiment 1, using orthogonal rotation to congruence (see
Sec.IIA4). The results of the matching procedure are given in Table VI. The dimensionality
of each subset and the total amount of spectral variance explained compares to the results
of Experiment 1 (Table III). This was to be expected since the vowel configurations
derived from both types of listening experiments were highly congruent.

Table VI. Results of matching
vowel configurations in the perceptual space, derived by MDPREF from semantic scale
judgments, and in the spectral space. For both the variance in semantic scale judgments
and spectral variance, the distribution over matched dimensions is given. The correlation
between perceptual and spectrum coordinate values along the presented dimensions is for
all dimensions beyond the 0.01 level of significance except for the figures within
brackets, which were significant beyond the 0.05 level. Ls represents the number of
listeners for each subset. The coefficient of alienation has been computed over all given
dimensions D.

set

vowel

Fo(Hz)

Ls

percentage variance in semantic
scale judgments

total spectral variance

percentage spectral variance

coefficient of alienation

.

.

.

.

D1

D2

D3

D4

sum

dB2

D1

D2

D3

D4

sum

.

II 8 M

/u/

220

16

49

17

12

10

88

255

28

23

22

18

91

0.51

III 8 M

/i/

220

14

54

20

8

4

86

239

16

27

13

36

92

0.46

V 9 M+F

/a/

220

15

57

14

9

-

80

200

32

31

22

-

85

0.53

IX 9 F

/a/

392

21

33

(23)

-

-

56

114

18

(14)

-

-

32

0.88

X 9 M

/a/

392

16

58

15

11

-

84

332

39

28

21

-

88

0.48

XI 8 M

phrase

16

57

14

10

5

86

.

.

.

.

.

.

.

The good fit between vowel configurations in the
perceptual space and the spectral space allows us to associate properties between the
corresponding directions in both spaces. This means that we can assign a spectral vector
to a perceptual semantic-scale vector. This spectral vector is a linear combination of the
original spectral dimensions (related to the sound level in frequency bands). The
contribution of each original dimension (frequency band) to the spectral vector is
expressed as the direction cosine of the angle between the spectral vector and the
original dimension. The value of this direction cosine can vary between 1 (identical), 0
(unrelated), and -1 (identical, but in opposite direction). The presentation of the values
of the direction cosines as a function of the center frequency of the frequency bands is
called the profile of the spectral vector (see also Bloothooft and Plomp, 1985). When a
spectral vector is matched with a perceptual semantic-scale vector, the profile can be
considered to represent the spectral variation which underlies perceptual judgments on a
semantic scale.

Spectral vectors can be derived for all individual
semantic-scale judgments on stationary vowels. It would be of interest to search for
spectral descriptions with a general validity for semantic scales. Unfortunately, the
large intersubject differences in the interpretations of semantic scales, demonstrated
previously, do not allow this. However, we can give the corresponding spectral
interpretations of the principal dimensions of the perceptual space, derived by MDPREF.
These dimensions are determined on the basis of the explained variance in the
semantic-scale judgments; the first dimension explains most of this variance, the second
dimension most of the remaining variance etc. In Fig. 7 the main results are presented.
This figure shows for the five subsets of stationary vowels: (1) the average spectrum, (2)
the profiles of the spectral vectors associated with the first two matched perceptual
dimensions; because these vectors define an orthogonal basis of the matched spectrum space
they are called basis vectors, and (3) the vowel configuration in the plane of the first
two dimensions of the perceptual space. Table VI indicates that for most subsets more than
half of the total variance in perceptual judgments is covered by the first dimension of
the perceptual space. Figure 7 (panels of the second column) demonstrates that this
dimension typically has spectral-slope weighting properties. Spectral slope is independent
of the vowel type of a subset and has a general interpretation. This corresponds well with
results of a factorial analysis of verbal attributes of timbre by von Bismarck (1974a),
who found, for complex stationary harmonic tones, only one prominent attribute: sharpness.
Von Bismarck (1974b) and Benedini (1980) demonstrated that sharpness was related to the
relative importance of higher harmonics. It is remarkable, however, that spectral slope
also plays the most important role when the corresponding amount of spectral variance is
relatively low, as is the case for subset III (see Table VI). For some other subsets,
variation in spectral slope coincides with typical properties of the vowel subset: in
subset II the singers 3 and 8 colored the vowel /u/ towards /o/; the spectral effect of
this phonemic difference (all formant frequencies of /o/ are higher than those of /u/) is
represented along dimension I; for subset X the configuration in the perceptual space
(Fig. 7, last column) shows that dimension I contributes highly to the differentiation
between falsetto and modal registers (except singer 6, a very "dull" tenor). In
subset V, dimension I of the perceptual space shows that the differentiation between
soprano and alto singers has spectral-slope like properties (see also Bloothooft and
Plomp, 1986a). In Sec.IIIB2 it was said that the semantic scale soprano-alto is used in
the same way as sharp-dull. This correspondence between soprano and sharp implies that
strong higher harmonics in the vowel sounds used in this experiment are associated with
soprano singing, which is completely contrary to the actual spectral differences between
soprano and alto singers.

The profiles of the second basis vector (third
column of Fig. 7) show that the related perceptual dimensions have no general acoustical
interpretation. The properties of these dimensions are probably related to the effects of
vowel articulation. For subsets II and III, the vowels /u/ and /i/ respectively, the
second dimension weights the depth of the spectral valley between lower and higher
formants. This dimension is, for the vowel /i/ (subset III), strongly related to phonemic
differences: most vowels /i/ were colored towards /y/, except for the singers 2, 5, and 6.
In subset V, with combined male and female phonations of /a/, the second dimension weights
the frequency positions of the peaks of higher formants. This property roughly
discriminates between tenor and bass singers, as can be seen in the configuration in the
perceptual space (Fig. 7, last column; see also Bloothooft and Plomp, 1985). For subsets
IX and X, the vowel /a/, the profile of the second basis vector is comparable and weights
the level of the frequency band of 1.2 kHz. Such a profile differentiates between the
phonemes / / and /a/; not all singers produced the requested phonemic quality precisely.
It may be noted that for subsets II, III, and IX the perceptual vowel configuration (Fig.
7, last column) does not have a relationship to the voice classifications of the singers.

The distribution of the variance in semantic-scale
judgments over the perceptual dimensions, and therefore the order of importance of these
dimensions, depends on our choice of selected scales. We cannot exclude that a single
semantic scale describes a specific perceptual dimension, explaining little variance,
while a large number of scales may be focused on one other perceptual dimension,
explaining a large amount of variance. In previous sections it has been shown that for the
subsets of stationary vowels there is agreement among listeners about scales which
describe sharpness, the first dimension in the present analysis. A detailed study of the
data did not reveal another perceptual dimension for which listeners agreed in their
description. Therefore, the second and higher dimensions mainly rely on the extent to
which listeners, unsystematically, use the acoustical properties of these dimensions in
their judgments.

In summary, when listeners are requested to judge
stationary vowels on semantic scales, they seem to focus primarily on differences in
spectral slope between the vowels, even when this difference is smaller than those for the
other perceptual dimensions. A large number of different semantic scales, related to
sharpness, are judged according to this criterion. For other perceptual dimensions there
is no agreement among listeners on verbal attributes.

6. Acoustic correlates of perceptual
dimensions of song phrases

The acoustic properties of the song phrase
'Halleluja' are much more complex than the spectral variation between stationary vowels.
Temporal aspects, such as vibrato and vowel duration, may also have influenced the
judgments of listeners. This makes it difficult to relate perceptual dimensions to
possible acoustic correlates. Since the first / / and final /a/ of 'Halleluja' took up
more than half of the total phrase duration, we used these two vowels to investigate
spectral correlates. The 1/3-oct spectrum of the vowel segments was measured every 10 ms,
the resulting 10-ms spectra were normalized for overall sound-pressure level and the
average of these spectra was considered to be the representative spectrum of each singer.
Subsequently, the configuration of the corresponding eight points in the spectrum space
was matched with the perceptual configuration. Three dimensions showed significant
correlations (p<0.01). The profiles of the spectral basis vectors associated with the
first three perceptual dimensions are shown in Fig. 8, together with the grand-average
spectrum. The first dimension accounts for 45 % of the total spectral variance. The
profile of the first basis vector shows that the corresponding perceptual dimension,
describing sharpness, is, for song phrases too, associated with spectral-slope-like
variation. The profile of the second basis vector strongly weighs the sound level of the
frequency bands with center frequencies of 0.8 and 2.5 kHz. This indicates that a positive
contribution of this dimension, perceptually associated with full, melodious, and
colorful, is related to a more open vowel, /a/ instead of /a/, and a higher sound level of
the high spectral peak. The latter peak is also known as the singer's formant and
described as the origin of the "ring" of the voice (Bartholemew, 1934;
Bloothooft and Plomp, 1986b). The frequency position of this spectral peak is weighted by
the third basis vector. No systematically used verbal attribute was associated with this
direction.

The amount of spectral variance accounted for by the
second and third dimensions was only 9 and 12 %, respectively. Therefore, it may well be
possible that other acoustical factors than those present in the average spectrum of the
two vowels contribute to these dimensions. Concerning the influence of temporal measures
on perceptual judgments, no effect of total phrase duration (tempo) could be established.
However, this could be expected since the listeners were requested to ignore this factor.
The specifically temporal semantic scale vibrato-straight was judged to a great extent on
the basis of the depth of vibrato modulations (r = 0.74) and not on vibrato rate (r =
-0.13). Fig. 2 showed that this scale was both positively related to sharpness and to
general evaluation (full, melodious, colorful). Therefore, the presence of a good vibrato
may contribute as a temporal attribute to these factors.

C. Discussion

The experimental results showed that the human
peripheral auditory system can detect very small differences in timbre. Acoustically,
these differences amount to only a few decibels per 1/3-octave frequency band. Detection
of timbre differences can be modelled as the observation of distance in a multidimensional
perceptual space. With respect to the ear's frequency-analyzing power, the maximum number
of dimensions of this perceptual space may be high, theoretically. However, in most cases
this number will depend on the Euclidean dimensionality of the spectral variation in
stimuli, with the restriction of a minimum amount of spectral variation of about 34 dB2
per dimension. For harmonic vowel sounds in speech and singing, which are limited with
respect to produced spectral variation, this criterion implies for a single vowel, sung by
different singers, a maximum of about four independent perceptual dimensions. For the
entire vowel system the number of dimensions is only slightly higher, with a maximum of
about five dimensions, because a great deal of spectral variation between different vowels
is already captured by the type of spectral variation within a single vowel. The number
five corresponds well with the number of resonance frequencies describing the properties
of the vocal tract.

The representation of stimuli in a perceptual space
was based on a grand average for a large number of listeners, whose individual perception
may deviate from this idealized grand-average representation. The use of INDSCAL analysis
and MDPREF analysis revealed those intersubject differences. It is interesting, however,
that the two different experimental techniques we used, similarity comparison and semantic
scaling, resulted in comparable stimulus configurations. Whereas in the
similarity-comparison experiment detection properties of the auditory system play a large
role, this is not self-evident in the semantic-scaling experiment. The latter approach
makes use of adjectives which require the mediation of some internal reference. That both
types of experiments still resulted in comparable grand-average stimulus configurations
supports the view that these representations are really basic for human timbre perception.
For each individual, however, a central weighting of perceptual dimensions may influence
experimental outcomes, both in the similarity-comparison and in the semantic-scaling
experiments.

We showed that a difference in timbre correlates
with a difference in 1/3-oct spectrum for stationary vowels, at least up to a fundamental
frequency of 392 Hz. This allows us to investigate properties of vowel representations in
the perceptual space on the basis of 1/3-oct spectra only, that is, without the need to
perform time-consuming perceptual experiments. For vowels sung by professional singers,
results of such a study have been reported in Bloothooft and Plomp (1984, 1985, 1986a,
1986b).

For all subsets of stationary vowels, only sharpness
turned out to be a verbal attribute of timbre on which most listeners, regardless of their
degree of musical training, agreed in their judgments; they only differed in their
evaluation of sharpness: whether sharpness was melodious or not. In conformity with von
Bismarck (1974b), sharpness was found to be acoustically related to the slope of the
spectrum. The importance of sharpness in timbre perception was even apparent for subset
III, in which only 16 % of total spectral variance was associated with this factor. These
results support von Bismarck's opinion that sharpness may be considered as a fundamental
perceptual quality, besides pitch and loudness, of any harmonic complex tone.

For spoken and sung vowels both a psycho-acoustical
level and a phonetic level of perception may be distinguished. At the more central,
phonetic, level the phonemic identity of a vowel is determined. This level is especially
sensitive to formant-frequency variation (Klatt, 1982). It may well be possible that a
number of verbal attributes of the timbre of vowel sounds refer to this level of
perception and describe, for instance, formant-frequency deviations from typical reference
values. The acoustical interpretation of such verbal attributes would then be vowel
dependent. Since our subsets each included only one vowel, this kind of timbre description
could have emerged from the listening experiments. Although the second perceptual
dimension of the vowel subsets, shown in Fig. 7, did turn out to be related to
vowel-specific acoustical variation, no indications were found that listeners agreed in
their verbal description of this variation. This suggests that there are no stable verbal
attributes for the phonetic level of perception under the experimental conditions used
here.

The present experiments failed to reveal a
relationship between the description of timbre of stationary vowels and voice
classification (see Fig. 7). In all cases both the semantic scales tenor-bass and
soprano-alto were used in the same way as sharp-dull: the more high-frequency energy, the
higher the estimated voice classification. In fact, sharpness was unrelated to actual
voice classification at all and even showed a reverse relationship with female voice
classification for subset IX (Fig. 7). Whereas such results may be attributed at first
sight to the restrictions of stationary vowels, which make it impossible to estimate voice
classification even for musically trained listeners, the observation persisted to some
extent for the phrases sung by male singers. Although in this case the relationship
between judgments on the scale tenor-bass (19) and actual voice classifications was rather
good (as an example, see the stimulus configuration in Fig. 4), a contradictory result was
obtained for tenor singer 6, who had a rather dull voice and was associated with the
lowest voice classification. For the song phrases the semantic scale tenor-bass was also
highly correlated with the first perceptual dimension and, therefore, associated with the
slope of the spectrum (Fig. 8). Listeners seem to relate a shallow spectral slope to
tenor-voice timbre type and a steep negative spectral slope to bass-voice timbre type. A
shallow slope may originate in the spectral effect of high first and second formants (e.g.
Fant, 1960) or in a shallow source spectrum. Cleveland (1977) indicated that higher
formant frequencies are indeed associated with higher voice classifications in
professional male singers. The contradictory result for "dull" tenor singer 6 in
the present experiment possibly demonstrated the confusing influence of a steep source
spectrum. This raises the interesting question of whether perceptual voice classification,
based on timbre, has a phonetic basis (formant frequency detection) or a psycho-acoustical
basis (sharpness detection). The present results suggest a psycho-acoustical basis which
may lead, however, to incorrect judgment of voice classification. Fortunately, many more
factors establish voice classification, which obviates a wrong judgment on the basis of
timbre alone.

The limitation in most of our experiments to
steady-state vowels made it possible to conduct well-defined perceptual experiments, but
this experimental paradigm may be rather far-off from perception in real singing
performance. It was stressed in a review by Risset and Wessel (1982) that dynamic factors
in timbre contribute to the identification and naturalness of musical instruments. We may
assume that this will also be the case for the singing voice. Nevertheless, it was our
informal observation that many typical characteristics of sung vowels are also present in
their steady-state versions, irrespective of their somewhat mechanical character.
Furthermore, our listening experiment with song phrases was a first attempt to bridge the
gap between experiments with stationary sounds and real singing performance. Just as for
stationary vowels, sharpness was the most important spectral attribute of the timbre of
song phrases. The second perceptual dimension (colorfulness) indicated the influence of
the relative sound level of the singer's formant. Vibrato was not found to take up a
separate perceptual dimension but vibrato quality may enhance judgments on both sharpness
and colorfulness, especially when the spectral contributions of these perceptual
dimensions are small.

Finally, we should be careful in interpreting
semantic scales in terms of acoustical properties in view of the small number of singers
in the subsets. Accidental combinations of acoustic characteristics of singers, or their
absence, may have influenced the results. Nevertheless, we trust that the main effects,
found for most subsets, are likely to have a more general validity.

Acknowledgement

This research was supported by the Netherlands
Organization for the Advancement of Pure Research (ZWO). The authors are indebted to the
singers for their kind cooperation and to Louis C.W. Pols and two anonymous reviewers for
their comments on earlier versions of this paper.

Note 1.

More precisely, we used the matrix analogue of a
coefficient of alienation, S½. For the one-dimensional case, S is the analogue of 1-r¨,
where r is the correlation coefficient. For example, for the one-dimensional case, for r =
0.81 (a significant correlation for 9 pairs of data points at the 0.01 level), S½ = 0.60.

Vennard, W. (1967). "Singing, the mechanism and
the technique," (Fisher, New York).

Legends of figures 1-8

Fig. 1. Result of an INDSCAL analysis on data from
similarity-comparison judgments of the vowel /u/, sung by eight male singers at Fo = 220
Hz. The upper panels show the I-II and the I-III planes of the object space and spectral
space combined. Filled circles form the vowel configuration obtained from INDSCAL, open
circles present the best fitting spectral vowel configuration. The lower panels show the
corresponding planes from the subject space of INDSCAL. Coordinate values of points
represent the weight a subject attaches to a dimension. Open squares are musicians, filled
squares are non-musicians.

Fig. 2. Representation of relations between semantic
scales by means of the results of INDSCAL analyses. The I-II plane of both the object
space (semantic scales) and the subject space (listeners) is presented for the subset of
song phrases and the subsets of stationary vowels, both for musicians and non-musicians.
Semantic scale numbers refer to Table IV.

Fig. 3. Example of presentation of results of MDPREF
analysis. Stimuli are represented as points in a space and semantic scales are represented
as vectors on which the projection of the stimulus points gives the best estimate of a
listener's judgment.

Fig. 4. Perceptual space for the song phrases sung
by eight male singers (large filled circles, the numbers of which refer to Table I) with
various results of semantic-scale vectors:

(a) All accepted semantic scales (numbers refer
to Table IV) for musician 7.
(b) All accepted semantic scales for non-musician 2.
(c) All accepted listeners on the semantic scale clear-dull; small open
circles refer to musicians, small solid circles to non-musicians.
(d) All accepted listeners on the semantic scale colorful-colorless; small
open circles refer to musicians; small solid circles to non-musicians.

Fig. 6. Results of matching vowel configurations in
a perceptual space derived from the similarity-comparison experiment (filled circles) and
in a perceptual space, derived from semantic scaling experiments (open circles), in the
first two best fitting dimensions. The subset consisted of the vowel /u/, sung by eight
male singers at Fo = 220 Hz (see also Fig. 1).

Fig. 7. Results of matching vowel configurations in
the perceptual space (derived from semantic-scaling experiments) and the spectrum space.
Left-hand panels show the grand-average spectrum of each subset; the middle panels present
the profiles of the spectral basis vectors associated with the first two perceptual
dimensions, and the right-hand panels show the vowel configurations in the perceptual
space. Numbers refer to singers (see Table I), f indicates falsetto register, t indicates
a tenor-like phonation produced by baritone singer 3.

Fig. 8. Results of matching the perceptual
configuration of song phrases and the configuration of average spectra of the two vowels /
/ and /a/ in 'Halleluja'. The grand-average spectrum and the profiles of the first three
spectral basis vectors are shown.