Re: speech/music -> speech/singing (Dan Ellis )

Bruno offers three low-level features to distinguish speech from music:
speech has smoothly-varying pitch, smoothly-varying formant structure, and
is not very strictly rhythmic; in contrast, music tends to have
piecewise-constant 'pitch', changes in spectra which are abrupt when they
occur, and much stricter rhythm.
Bruno also pointed to a fascinating example, singing, which made me think:
how can you tell the difference between someone talking and someone
singing? My informal observation is that you *can* do this pretty easily,
and I suspect that of the three cues, it's the *pitch* variation (or lack
of it) that's the most important factor (although the other two certainly
apply).
Two supporting observations are (1) when someone happens to hold a voiced
sound in speech *with a pitch stability that would be acceptable in music*,
it sounds out of place after a very short time. (I guess that with the
coupling of vocal-fold vibration rate with sub-glottal pressure, it
actually requires quite a lot of 'trimming' to keep pitch constant as a
syllable trails off). Without having done the investigation, I believe
that if you look at the pitch-track even of long filled-pauses in speech,
then compare them to sung vowels, you'll find that pitch is held markedly
more constant in 'musical' voice sounds.
Observation (2) is that when looking at speech spectrograms that
occasionally have music mixed in, it is often immediately obvious where the
music appears, owing to the very 'flat' time-extended fixed-frequency
harmonics of the music, appearing as long horizontal striations. (I'm
thinking here of wideband spectrograms, where the harmonics of the speech
are in fact rarely visible at all).
This suggests that you can look for music just by looking for the extended
(isolated, i.e. high-f0) harmonics that show an unnatural stability in
frequency. This is what Eric mentioned in his original reply, referring to
Mike Hawley's work.
Personally, I like the idea of having a multiple-pitch-sensitive (i.e.
polyphonic) model of pitch perception, and looking for music's
unnaturally-stable pitch-tracks in the output of that. That would also
give you a domain to spot the converse, the fluctuating pitch-tracks that
might form the bottom-up starting point of extracting and recognizing
speech.
[All this to show that I'm not *opposed* to bottom-up mechanisms, it's just
that I've had a personal revalation as to the importance of the top-down
processes that act in combination with them, and I now feel the obligation
of a zealot to make sure that the expectation-based mechanisms are given
their due consideration in any debate. But Malcolm has that covered in
this instance ;-)]
DAn.